OpenCitations: a short introduction Silvio Peroni1,2 1 Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy 2 Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy Abstract In this paper, I introduce a brief history of open citations, their main characteristics and use in the context of OpenCitations, a scholarly infrastructure organisation dedicated to open scholarship and the publication of open bibliographic and citation data using Semantic Web technologies. Keywords OpenCitations, open citation data, open bibliographic metadata, Semantic Web 1. The origins of open citations The concept of open citations [1] is strongly tied with that of the Web. Since 1989, the Web has drastically changed how we think about academic publishing and science. Publishers have adopted Web Standards to create and deliver their products quickly and to a broader audience. Standards, guidelines, and services based on Web technologies have been proposed in the past 30 years to increase the discoverability of academic products and publications, improve research practices and allow reusability of scholarly data in different applicative contexts. Open citations are no exception. Even if the definition of open citations has been introduced recently, past works implicitly started to highlight their main characteristics. As far as I know, the first embryonal description of open citations is in Robert Cameron’s visionary article published in 1997 [2]. In this article, he speculated about the existence of a decentralised and freely available Universal Citation Database. Such a database would have had daily updates and links to every scholarly work, providing information for all types of publications (from journal articles to technical reports, datasets, and other publication types) and being equally visible and accessible to all. From this initial Web age, things have started to develop. In the same year of Cameron’s article, CiteSeer was established [3], a service that crawled citations from PostScript documents available on the Web. Along the same lines, a few years later, CiteBase was created in the context of the OpCit project [4]. In 2004, Google Scholar (https://scholar.google.com) was launched to provide one of the first open Web interfaces for looking at a scientist’s paper and the citations ULITE-ws: Understanding Literature References in Academic Full Text at JCDL 2022 Envelope-Open silvio.peroni@unibo.it (S. Peroni) GLOBE https://essepuntato.it (S. Peroni) Orcid 0000-0003-0530-4305 (S. Peroni) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) that this paper received (even if the data are not openly accessible). A few years later, CiteSeerX [5] was proposed as an evolution to CiteSeer to address some problems of its predecessor. However, the tipping point for open citations was when, in 2009, David Shotton in- troduced the concept of semantic publishing [6], which concerns the use of Semantic Web technologies applied to the scholarly publishing domain to make journal articles and other scholarly publications more discoverable and reusable. This idea led him to the JISC OpenCitations project in 2010 (https://opencitations.wordpress.com/2010/07/15/ jisc-open-citations-aims-objectives-and-final-outputs/), a year-long project (with a subsequent extension) that aimed at creating the first corpus of open citation data entirely available on the Web by using URLs to identify resources and RDF to expose these data to the public. The idea of providing open citations was spread to the scholarly and publishing community in the following years in two different editions of the Annual Conference of Open Access Publishers (OASPA, https://oaspa.org/conference/). Both David Shotton’s talk at OASPA 2013 and Dario Taraborelli’s speech at OASPA 2016 highlighted the essential need to release citation data as soon as possible for the whole scholarly community. Since 2016, everything has started to change on a large scale. The importance of open ci- tations got a broader audience and led to the introduction of OpenCitations [7] and WikiCite as testimonials of communities providing open citation data and services to allow their pro- grammatic access. After these first technical implementations, in 2017, the Initiative for Open Citations (I4OC, https://i4oc.org) was launched to convince publishers to make their reference lists free and openly available on Crossref (https://crossref.org) [8]. In the following years, other international events and scholarly initiatives helped increase the interest in open citations and related technical infrastructures. The successful movement toward public domain citation data is now more strong than ever and “improve the transparency and robustness of scientific portfolio analysis, improve science policy decision-making, stimulate downstream commercial activity, and increase the discoverability of scientific articles” [9]. 2. What is an open citation In the previous section, I used the concept of open citations several times. However, I have not clarified what it is about and the characteristics a citation must have to be claimed as open. However, first, it is necessary to explain what I refer to when I mention the word citation. A bibliographic citation is a conceptual directional link from a citing entity to a cited entity to acknowledge or ascribe credit for the contribution made by the authors of the cited entity. This link is defined using particular textual devices such as a bibliographic reference in the reference list, denoted by an in-text reference pointer – e.g. “[3]” or “(Doe et al., 2013)” – within the body of the citing entity. The citation data related to a particular citation must include the representation of such a conceptual directional link and the basic metadata of the citing entity and the cited entity, i.e. sufficient information to create or retrieve textual bibliographic references for each of the entities involved in the citation (i.e. the citing entity and the cited entity). A bibliographic citation is an open citation when the data needed to define the citation are compliant with the following principles [10]: Figure 1: The five principles citation data must comply with to talk about an open citation. • structured – citation data must be expressed in one or more machine-readable formats such as JSON or RDF; • separate – citation data must be available without the need to access the source bibli- ographic entity (e.g. the article or book) in which the citation is defined, which can be even behind a paywall; • open – citation data must be freely accessible and reusable without restrictions, for example, by publication under the CC0 1.0 Universal waiver/license; • identifiable and available – citing and cited entities must be identified by using a specific persistent identifier scheme (e.g. a DOI) or a URL. In addition, by resolving the identifiers of the citing and cited entities, it must be possible to obtain the basic metadata of both entities, sufficient to create or retrieve textual bibliographic references for each of them. Such basic entity metadata must also be structured, separate and open. These principles have been thoroughly followed in the technical developments of OpenCita- tions, introduced in the following section. 3. OpenCitations, a scholarly infrastructure organisation OpenCitations (https://opencitations.net) [7], of which I am proudly one of its directors, is a scholarly infrastructure organisation dedicated to open scholarship and the publication of open bibliographic and citation data using Semantic Web technologies. We also undertake advocacy for open scholarly metadata, mainly via the Initiative for Open Citations (I4OC, https://i4oc.org) and the Initiative for Open Abstracts (I4OA, https://i4oa.org). Our goal is to provide open metadata with a scope, depth, accuracy and provenance surpassing commercial sources. We provide the OpenCitations Data Model [11] that we use to describe all the bibliographic metadata and citation data OpenCitations provides. Of course, we also provide bibliographic and citation data (all released using the CC0 waiver to maximise their reuse), available in different collections, including the OpenCitations Indexes, our primary collection. In addition, all the software we developed to gather and expose these data is available in our GitHub repository (https://github.com/opencitations) and released with open source licenses. Finally, all the data are available online: full dumps of OpenCitations data can be downloaded and accessed programmatically via REST APIs, SPARQL endpoints, and other Web interfaces. Our primary database, COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations) [12], currently hosts more than 1.29 billion citations. All these citations have been made available in Linked Open Data. They can be accessed programmatically using our REST API by specifying either publication’ DOI or the Open Citation Identifier (OCI) [13] identifying the complete citation, i.e. the relation entity A cites entity B. Since 2020, OpenCitations has significantly benefited from crowdfunding from the scholarly community, which has resulted from (a) the Global Sustainability Coalition for Open Science Services’s (SCOSS, https://scoss.org) selection of OpenCitations as a scholarly infrastructure worthy of support, and (b) its involvement in international projects, such as the OpenAIRE- Nexus project (https://www.openaire.eu/openaire-nexus-project) and RISIS project (https:// www.risis2.eu/). OpenCitations espouses the UNESCO principles of Open Science [14], the Principle of Open Scholarly Infrastructures [15], the FAIR data principles that data should be Findable, Accessible, Interoperable, and Reusable [16], and the I4OC principles that citation data should be Struc- tured, Separable, and Open (https://i4oc.org/#goals). In compliance with these values, one of OpenCitations’ main priorities is to keep its services, software, and data always without charge under open licenses (CC0 for data and ISC for software) to foster their maximum reuse. 4. Conclusions and future directions This undeniable aspect of keeping all OpenCitations data and services free leads to an acknowl- edged sustainability issue, principally in terms of salaries and technical infrastructure costs. OpenCitations can rely on an international network of generous supporters that apply for membership and donation programmes. We are grateful to the institutions that believe in our mission and values. However, we are already far from being a fully financially sustained infras- tructure, and we still need help from the global scholarly community to keep open bibliographic and citation data and related services available for many years and to reach the following goals: • to provide high-quality metadata with full provenance relating to scholarly publications and the citations that link them, including those in areas such as the humanities and social sciences, the global south, and non-English publications; • to expand our coverage into the ‘grey literature’ of reports, patents, datasets, software, etc.; • to surpass in terms of coverage and quality – and thereby provide an open and free alternative to – the major commercial citation indexes; • to provide the open data crucial for research in bibliometrics and scientometrics, and the creation of transparent and reproducible metrics for research assessment; • to continue developing and making public free and open-source software (FOSS) with relevant functionality and our open services built over our data. We (scholars, institutions, founders, etc.) can make a difference and create an open, inclusive future for science and research. OpenCitations is a plural: together, we are OpenCitations. Acknowledgments I want to thank David Shotton and Chiara Di Giambattista, who provided insightful discussions, materials and feedback which has been included in this paper, and all the people working for and supporting OpenCitations. This work has been funded by the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017452 (OpenAIRE-Nexus). References [1] D. Shotton, Open citations, Nature 502 (2013) 295–297. doi:1 0 . 1 0 3 8 / 5 0 2 2 9 5 a . [2] R. D. Cameron, A Universal Citation Database As a Catalyst For Reform In Scholarly Communication, First Monday 2 (1997). doi:1 0 . 5 2 1 0 / f m . v 2 i 4 . 5 2 2 . [3] C. L. Giles, K. D. Bollacker, S. Lawrence, CiteSeer: an automatic citation indexing system, in: Proceedings of the third ACM conference on Digital libraries - DL ’98, ACM Press, Pittsburgh, Pennsylvania, United States, 1998, pp. 89–98. doi:1 0 . 1 1 4 5 / 2 7 6 6 7 5 . 2 7 6 6 8 5 . [4] T. Brody, S. Harnad, L. Carr, Earlier Web usage statistics as predictors of later citation impact, Journal of the American Society for Information Science and Technology 57 (2006) 1060–1072. doi:1 0 . 1 0 0 2 / a s i . 2 0 3 7 3 . [5] H. Li, I. Councill, W.-C. Lee, C. L. Giles, CiteSeerx: an architecture and web service design for an academic document search engine, in: Proceedings of the 15th international conference on World Wide Web - WWW ’06, ACM Press, Edinburgh, Scotland, 2006, p. 883. doi:1 0 . 1 1 4 5 / 1 1 3 5 7 7 7 . 1 1 3 5 9 2 6 . [6] D. Shotton, Semantic publishing: the coming revolution in scientific journal publishing, Learned Publishing 22 (2009) 85–94. doi:1 0 . 1 0 8 7 / 2 0 0 9 2 0 2 . [7] S. Peroni, D. Shotton, OpenCitations, an infrastructure organization for open scholarship, Quantitative Science Studies 1 (2020) 428–444. doi:1 0 . 1 1 6 2 / q s s _ a _ 0 0 0 2 3 . [8] G. Hendricks, D. Tkaczyk, J. Lin, P. Feeney, Crossref: The sustainable source of community- owned scholarly metadata, Quantitative Science Studies 1 (2020) 414–427. doi:1 0 . 1 1 6 2 / qss_a_00022. [9] B. I. Hutchins, A tipping point for open citation data, Quantitative Science Studies 2 (2021) 433–437. doi:1 0 . 1 1 6 2 / q s s _ c _ 0 0 1 3 8 . [10] S. Peroni, D. Shotton, Open Citation: Definition, 2018. URL: https://doi.org/10.6084/m9. figshare.6683855, version 1.0. [11] M. Daquino, S. Peroni, D. Shotton, G. Colavizza, B. Ghavimi, A. Lauscher, P. Mayr, M. Ro- manello, P. Zumstein, The OpenCitations Data Model, in: The Semantic Web – ISWC 2020, volume 12507 of Lecture Notes in Computer Science, Springer, Cham, Switzerland, 2020, pp. 447–463. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 6 2 4 6 6 - 8 _ 2 8 . [12] I. Heibi, S. Peroni, D. Shotton, Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, Scientometrics 121 (2019) 1213–1228. doi:1 0 . 1 0 0 7 / s11192- 019- 03217- 6. [13] S. Peroni, D. Shotton, Open Citation Identifier: Definition, 2019. URL: https://doi.org/10. 6084/m9.figshare.7127816. [14] UNESCO, UNESCO Recommendation on Open Science, Programme and meeting doc- ument SC-PCB-SPP/2021/OS/UROS, 2021. URL: https://unesdoc.unesco.org/ark:/48223/ pf0000379949. [15] G. Bilder, J. Lin, C. Neylon, The Principles of Open Scholarly Infrastructure, 2020. URL: https://doi.org/10.24343/C34W2H. [16] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez- Beltran, A. J. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3 (2016) 160018. doi:1 0 . 1 0 3 8 / s d a t a . 2 0 1 6 . 1 8 .