=Paper=
{{Paper
|id=Vol-3220/talk2
|storemode=property
|title=How to structure citations data and bibliographic metadata in the OpenCitations accepted format
|pdfUrl=https://ceur-ws.org/Vol-3220/invited-talk2.pdf
|volume=Vol-3220
|authors=Arcangelo Massari,Ivan Heibi
|dblpUrl=https://dblp.org/rec/conf/jcdl/MassariH22
}}
==How to structure citations data and bibliographic metadata in the OpenCitations accepted format==
How to structure citations data and bibliographic metadata in the OpenCitations accepted format Arcangelo Massari1 , Ivan Heibi1 1 Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Abstract The OpenCitations organization is working on ingesting citation data and bibliographic metadata directly provided by the community (e.g., scholars and publishers). The aim is to improve the general coverage of open citations, which is still far from being complete, and use the provided metadata to enrich the characterization of the citing and cited entities. This paper illustrates how the citation data and bibliographic metadata should be structured to comply with the OpenCitations accepted format. Keywords OpenCitations, Bibliographic metadata, Open citations, CSV 1. Introduction 2. Metadata and citations The Declaration on Research Assessment [1], the Leiden OpenCitations manages and processes two different CSV Manifesto for Research Metrics [2], and the Initiative files to separately characterize the ingested documents, for Open Citations (I4OC, https://i4oc.org/) have success- one containing their metadata (META-CSV), and a sec- fully convinced almost all major academic publishers to ond one holding their citations (CITS-CSV). On this sec- release their publication reference lists. To date, more tion we discuss how these files should be structured and than 1.2 billion citations are available through the Cross- defined before providing them to OpenCitations. The ref REST API [3] and distributed by OpenCitations [4] discussion presented in this section is based on a more as structured, separated from the original bibliographic exhaustive documentation [8]. source and under the CC0 license [5]. CSV files are logically structured as tables. In META- Nevertheless, the coverage of open citations is still CSV each document (row), is characterised by 11 at- far from complete [6]. On the one hand, some publish- tributes (columns): ers have not yet made their citations public. On the • id. the ID(s) of the corresponding document. A other hand, many citations are lost because they are only document can have more than one ID, each ID is present in unstructured format within PDF files, espe- defined by its type (using an acronym) and value. cially in social sciences. Multiple IDs must be separated using single white OpenCitations is working on ingesting citations and space, as follow: bibliographic metadata directly coming from the commu- nity (e.g., scholars and publishers). In this way, projects ID abbreviation + “:” + ID value like EXCITE [7] - aimed at extracting citations from PDFs For example “doi:10.3233/ds-170012” indicates - could significantly contribute to increasing the data cov- a DOI identifier having the value “10.3233/ds- erage. 170012”. The following section illustrates how to structure the • title. a textual value to express the title of the citation data and bibliographic metadata in the accepted document. format of OpenCitations. We conclude this paper with a • author and editor. Data regarding the authors description of the upcoming future related works. and the editors of the document. Each character (author/editor) is defined by several attributes, e.g., his family name or ID. Multiple characters are separated by a semicolon followed by a white space character. Generally, the definition of an JCDL’22: ULITE-ws, Understanding LIterature references in academic actor follows this structure: full TExt, June 24–06, 2022, Cologne, Germany Envelope-Open arcangelo.massari@unibo.it (A. Massari); ivan.heibi2@unibo.it Family Name + “,” + “ ” + Given Name + “ ” + “[” (I. Heibi) + IDs + “]” Orcid 0000-0002-8420-0696 (A. Massari); 0000-0001-5366-5194 (I. Heibi) The IDs of the authors/editors are specified in © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). square brackets and follow the format used for CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the “id” attribute. 1 Arcangelo Massari et al. CEUR Workshop Proceedings 1–4 e.g. “Peroni, Silvio [orcid:0000-0003-0530-4305]” • The fields “title”, “pub_date”, and “author” (or “ed- itor”) are mandatory for the resources of type In case of no IDs, the square brackets are omitted book, dataset (or data file), dissertation, edited from the character description either. The given book, journal article, monograph, other, peer re- name is not mandatory, however, the description view, posted content (or web content), proceed- of the character should still contain a comma to ings article, report, and reference book. Moreover, indicate such absence (e.g. “Peroni, [orcid:0000- this information is compulsory if the ”type” field 0003-0530- 4305]”) is empty. • pub_date. the date of publication of the doc- • The “title” and “venue” fields are required for the ument. The date is defined according to ISO resources of type book chapter, book part, book 86014[9], the ISO standard for “Representation of section, book track, component, and reference dates and times”: entry. YYYY-MM-DD • Only the “title” field is required for the resources of type book series, book set, journal, proceedings, It is mandatory to specify at least the publication proceedings series, report series, standard, and year. The values of the month and day are not standard series. required. However, if the day is specified, the • Regarding the resources of journal volume type, month must be specified as well. the fields “venue” and “volume”, or “venue” and • venue. data regarding the venue of the document. “title”, are mandatory. Conversely, as for re- For example, if the document is a journal article, sources of journal issue type, the fields “venue” the venue defines the journal where the document and “issue”, or “venue” and “title”, are mandatory. has been published. Each venue is described as follows: Table 1 shows an example of a well-formed META- CSV representation. The table contains a small sample of Venue Title + “ ” + “[” + IDs + “]” ten documents (rows) and their corresponding attributes The IDs of a venue are described using the same (columns). format used previously. In case of no identifiers, On the other hand, in the CITS-CSV each entity the square brackets are omitted. (row) represents a citation. A citation is characterised by 4 attributes (columns): citing_id, citing_publica- • volume and issue. these values are required tion_date, cited_id, and cited_publication_date. The only if the document is contained in a journal c i t i n g _ i d and c i t e d _ i d values represent the identifiers volume or a journal issue. of the citing and cited document, respectively. These • page. the page range of the corresponding doc- values are both mandatory, and they are structured fol- ument, defined through the specification of the lowing the same scheme used for id definition in META- first and the last page, divided by a hyphen “-”. CSV. The citing_publication_date and cited_publica- • type. a textual value to identify the document. tion_date represent the date of publication of the citing This value is taken from the list of the currently and cited document, respectively. Both these values are supported bibliographic resource types: book, optional, and follow the same structural scheme used for book chapter, book part, book section, book se- p u b _ d a t e definition in META-CSV. ries, book set, book track, component, dataset (or Table 2 shows an example of a well-formed CITS-CSV data file), dissertation, edited book, journal, jour- representation. The table contains a small sample of nal article, journal issue, journal volume, mono- ten different citations (rows) and their corresponding graph, other, peer review, posted content (or web attributes (columns). content), proceedings, proceedings article, pro- ceedings series, reference book, reference entry, report, report series, standard, and standard se- 3. Discussion and conclusion ries. • publisher. the publisher name of the correspond- This paper described how to define well-formed CSV files ing document. To define a publisher we apply the storing citations and metadata of bibliographic resources, same format used in the definition of the v e n u e . ready to be provided and later processed by OpenCita- tions. If the resource identifier is specified in the “id” field, all The ingestion of bibliographic metadata will be pos- the other fields are optional. Conversely, if the “id” field sible starting from the release of OpenCitations Meta is empty, there are mandatory fields that vary depending (OC-Meta), expected by the end of 2022. OC-Meta will on the resource type: store bibliographic metadata for the documents involved 2 Arcangelo Massari et al. CEUR Workshop Proceedings 1–4 (as citing or cited entities) in OpenCitations citation in- [5] S. Peroni, D. Shotton, Open citation: Defi- dexes. nition, 2018. doi:1 0 . 6 0 8 4 / M 9 . F I G S H A R E . 6 6 8 3 8 5 5 . V 1 , The ingestion of the citations is possible thanks to artwork Size: 95436 Bytes Publisher: figshare. CROCI, the Crowdsourced Open Citations Index, which [6] A. Martín-Martín, Coverage of open citation data allows individuals identified by ORCIDs to deposit the approaches parity with web of science and scopus, citation data that they have legal right to submit [10]. OpenCitations blog (2021). Citation data are submitted to either Figshare (https: [7] A. Hosseini, B. Ghavimi, Z. Boukhers, P. Mayr, //figshare.com) or Zenodo (https://zenodo.org), accompa- Excite–a toolchain to extract, match and publish nied by the ORCID of the contributor. Aftwerwards, the open literature references, in: 2019 ACM/IEEE Joint submitter can inform OpenCitations using the GitHub Conference on Digital Libraries (JCDL), IEEE, 2019, issue tracker on the CROCI repository (https://github. pp. 432–433. com/opencitations/croci/issues). [8] A. Massari, How to produce well-formed CSV files Future works include implementing an interface that for OpenCitations, 2022. URL: https://doi.org/10. simplifies and automates the entire publication process 5281/zenodo.6597141. doi:1 0 . 5 2 8 1 / z e n o d o . 6 5 9 7 1 4 1 . via CROCI, also providing input data validation and mod- [9] M. Wolf, C. Wicksteed, Date and time formats, ification suggestions. https://www.w3.org/TR/NOTE-datetime, 1997. Moreover, CROCI currently handles only DOI-to-DOI [10] I. Heibi, S. Peroni, D. M. Shotton, Crowdsourc- citations. The upcoming plan is to let CROCI manage ing open citations with CROCI - an analysis of also any-to-any citations. the current status of open citations, and a pro- posal, CoRR abs/1902.02534 (2019). URL: http: //arxiv.org/abs/1902.02534. a r X i v : 1 9 0 2 . 0 2 5 3 4 . Acknowledgments This work was funded from the European Union’s Hori- zon 2020 research and innovation program under grant agreement No 101017452 (OpenAIRE-Nexus Project). We want to thank Silvio Peroni for supervising the entire work on OpenCitations, Philipp Mayr-Schlegel and Ah- san Shahid for the feedback on the documentation from which this demo paper is drawn, and Davide Brambilla for the valuable insights about CROCI and its future de- velopments. References [1] R. Cagan, San francisco declaration on research assessment, Disease Models & Mechanisms (2013) dmm.012955. URL: https://journals.biologists.com/ dmm/article/doi/10.1242/dmm.012955/261854/ San-Francisco-Declaration-on-Research-Assessment. doi:1 0 . 1 2 4 2 / d m m . 0 1 2 9 5 5 . [2] D. Hicks, P. Wouters, L. Waltman, S. de Rijcke, I. Rafols, Bibliometrics: The leiden manifesto for research metrics, Nature 520 (2015) 429–431. URL: https://www.nature.com/articles/520429a. doi:1 0 . 1038/520429a. [3] G. Hendricks, D. Tkaczyk, J. Lin, P. Feeney, Cross- ref: The sustainable source of community-owned scholarly metadata, Quantitative Science Studies 1 (2020) 414–427. doi:1 0 . 1 1 6 2 / q s s _ a _ 0 0 0 2 2 . [4] S. Peroni, D. Shotton, OpenCitations, an infrastruc- ture organization for open scholarship, Quantita- tive Science Studies 1 (2020) 428–444. doi:1 0 . 1 1 6 2 / qss_a_00023. 3 A. Appendix Arcangelo Massari et al. CEUR Workshop Proceedings 4 Table 1: A sample of ten documents characterized by their corresponding metadata attributes id title author pub_date venue volume issue page type publisher editor Springer International Peroni, Silvio [orcid:0000-0003-0530-4305]; 17th ISWC doi:10.1007/978-3-030-00668-6_8 The SPAR Ontologies 2018 119-136 book chapter Publishing Shotton, David [orcid:0000-0001-5506-523X] [doi:10.1007/978-3-030-00668-6] [crossref:297] Data Science IOS Press doi:10.3233/DS-170012 Automating semantic publishing Peroni, Silvio [orcid:0000-0003-0530-4305] 2017 1 1-2 155-173 journal article [issn:2451-8484 issn:2451-8492] [crossref:7437] doi:10.1007/978-3-476-00160-3 Springer Science and isbn:9783476021144 Literatur 2005 book Business Media LLC Gfrereis, Heike isbn:9783476001603 [crossref:297] doi:10.1057/9780230316645 Springer Science and isbn:9780230276604 New Waves in Philosophy of Law 2011 book Business Media LLC Mar, Maksymilian Del isbn:9780230316645 [crossref:297] doi:10.4324/9781003115830 Informa UK Limited Governing Savages Markus, Andrew 2020-7-31 book isbn:9781003115830 [crossref:301] doi:10.1515/9781503600836 Walter de Gruyter GmbH Newsworthy Barbas, Samantha 2020-6-24 book isbn:9781503600836 [crossref:374] On the theory of High Temperature Pleiades Publishing Ltd doi:10.1134/s0018151x17020055 Gladkov, S. O. 2017-5 55 3 321-325 journal article convection of electrons in metals [issn:0018-151X issn:1608-3156] [crossref:137] High Temperature Pleiades Publishing Ltd doi:10.1134/s0018151x17050029 Stability of boiling shock Avdeev, A. A. 2017-9 55 5 753-760 journal article [issn:0018-151X issn:1608-3156] [crossref:137] The high-temperature High Temperature Pleiades Publishing Ltd doi:10.1134/s0018151x17050224 Zhakin, A. I. 2017-9 55 5 767-776 journal article and radiative effect on concrete [issn:0018-151X issn:1608-3156] [crossref:137] Relaxation of Rayleigh High Temperature Pleiades Publishing Ltd doi:10.1134/s0018151x18010169 Skrebkov, O. V. 2018-1 56 1 77-83 journal article and Lorentz Gases in Shock Waves [issn:0018-151X issn:1608-3156] [crossref:137] 1–4 Table 2: A sample of ten citations characterized by their related attributes citing_id citing_publication_date cited_id cited_publication_date doi:10.1016/j.websem.2012.08.001 2012-12 doi:10.1087/2009202 2009-04-01 doi:10.1016/j.websem.2012.08.001 2012-12 doi:10.1371/journal.pcbi.1000361 doi:10.1016/j.websem.2012.08.001 2012-12 doi:10.1007/978-3-642-33876-2_35 2012 doi:10.1016/j.websem.2012.08.001 2012-12 doi:10.1186/2041-1480-1-S1-S6 2010-06-22 doi:10.1016/j.websem.2012.08.001 2012-12 doi:10.1145/945645.945664 2003-10-23 pmid:23636598 2013 pmid:19151427 2005 pmid:23636598 2013 pmid:19782561 2008-10 pmid:23636598 pmid:18686754 2012-09-05 pmid:23636598 2013 pmid:15890079 2009-07-15 Arcangelo Massari et al. CEUR Workshop Proceedings pmid:23636598 2013 pmid:18191757 5 1–4