<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Freedom for bibliographic references: OpenCitations arise</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvio Peroniȯ</string-name>
          <email>silvio.peroni@unibo.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Shottonɞ</string-name>
          <email>david.shotton@oerc.ox.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Vitaliȯ</string-name>
          <email>fabio.vitali@unibo.it</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Scholarly citations from one publication to another, expressed as reference lists within academic articles, are core elements of scholarly communication. Unfortunately, they usually can be accessed en masse only by paying significant subscription fees to commercial organizations, while those few services that do made them available for free impose strict limitations on their reuse. In this paper we provide an overview of the OpenCitations Project (http://opencitations.net) undertaken to remedy this situation, and of its main product, the OpenCitations Corpus, which is an open repository of accurate bibliographic citation data harvested from the scholarly literature, made available in RDF under a Creative Commons public domain dedication. RASH version: https://w3id.org/oc/paper/occ-lisc2016.html</p>
      </abstract>
      <kwd-group>
        <kwd>Citation Database</kwd>
        <kwd>OpenCitations</kwd>
        <kwd>OpenCitations Corpus</kwd>
        <kwd>Scholarly Communication</kwd>
        <kwd>Semantic Publishing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Databases of citation data are among the most attractive and used artefacts
in the Scholarly Communication domain. They are one of the main tools
used by researchers for gaining knowledge about a particular topic, and by
scientists in Bibliometrics, Informetrics, and Scientometrics for analysing
the complex relationships that exist within huge networks of citations of
scholarly works. They also serve institutional goals, since they provide one
of the main mechanisms for assessing the quality of research by means
of (sometimes questionable) metrics and indicators calculated from such
citation databases. While some of these resources, e.g. Microsoft Academic
Graph3 and Google Scholar4, are freely accessible (but not downloadable),
those considered the most authoritative by institutions worldwide, namely
Scopus5 and Web of Science6, can be accessed only by paying significant
access fees, which may amount to tens of thousands of pounds annually [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
3
https://www.microsoft.com/en-us/research/project/microsoftacademic-graph/
4 https://scholar.google.com/
5 https://www.scopus.com/
6 http://webofscience.com/
      </p>
      <p>
        Reference lists within academic articles are core elements of scholarly
communication, since they both permit the attribution of credit and
integrate our independent research endeavours. But the cruel reality is that
these key data are not freely available. In the current age where Open
Access is considered a necessary practice in research, it is a scandal that
reference lists from scholarly publications (conference papers, books,
journal articles, etc.) are not readily and freely available for use by all scholars.
As we have already stated in a previous work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]:
      </p>
      <p>Citation data now needs to be recognized as a part of the Commons
– those works that are freely and legally available for sharing –
and placed in an open repository, where they should be stored in
appropriate machine-readable formats so as to be easily reused by
machines to assist people in producing novel services.</p>
      <p>
        This is the main premise behind the OpenCitations Project [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
which has created an open repository of scholarly citation data – the
OpenCitations Corpus (OCC) – made available under a Creative Commons
public domain dedication7 to provide in RDF accurate citation information
(bibliographic references) harvested from the scholarly literature. Since the
beginning of July 2016, the OCC has been ingesting and processing the
reference lists of scholarly papers available in Europe PubMed Central8. In
this paper we provide a brief overview of the OCC’s main components that
make possible the extraction, a description of such reference lists in RDF,
and a progress report concerning the available citation data.
      </p>
      <p>The rest of the paper is organised as follows. In Section 2 we recall the
story of the OpenCitations Project since its beginning in 2010. In Section 3
we describe the revised metadata specification and software tools that have
been recently developed within the OpenCitations Project for the creation
of a new and improved instantiation of the OCC. In Section 4 we briefly
describe other open (and RDF-based) repositories of scholarly document
metadata. Finally, in Section 5, we sketch out our future plans.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The story so far</title>
      <p>
        The OpenCitations Project formally started in 2010 as a one-year project
funded by JISC9 (subsequently extended for an additional half year), with
David Shotton as director, who at that time was working in the
Department of Zoology at the University of Oxford. The project’s goal was global
in scope, and was designed to change the face of scientific publishing and
scholarly communication, since it aimed to publish bibliographic citation
information in RDF and to make citation links as easy to traverse as Web
7 https://creativecommons.org/publicdomain/zero/1.0/legalcode
8 http://europepmc.org/
9 http://www.jisc.ac.uk/whatwedo/programmes/inf11/jiscexpo/
jiscopencitation.aspx
links. The main deliverable of the project, among several outcomes10, was
the release of an open repository of scholarly citation data described using
the SPAR (Semantic Publishing and Referencing) Ontologies11 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], namely
the OpenCitations Corpus, initially populated with the citations from
journal articles within the Open Access Subset of PubMed Central12 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>In May 2014, OpenCitations was adopted by the Infrastructure Services
for Open Access (IS4OA)13 as one of its academic Open Access services.
IS4OA is UK-based not-for-profit charitable company that aims to provide
benefit to the global community of research information users. It acts as
an umbrella organisation that supports openly accessible information and
discovery services relating to academic information, research results and
scholarly publications, by providing business structure and expertise and a
means of channelling financial support to these services.</p>
      <p>At the end of 2015, Silvio Peroni joined the OpenCitations Project as
co-director, with the aim of setting up a new instantiation of the Corpus
based on a new metadata schema and employing several new technologies
to automate the ingestion of fresh citation metadata from authoritative
sources. The current instantiation of the OCC is hosted by the Department
of Computer Science and Engineering (DISI) at the University of Bologna,
and since the beginning of July 2016 it has been ingesting, processing and
publishing reference lists of scholarly papers available in Europe PubMed
Central, as described in the following section.
3</p>
    </sec>
    <sec id="sec-3">
      <title>The new instantiation of the OpenCitations Corpus</title>
      <p>
        The OpenCitations Project (http://opencitations.net) has recently
created a new instantiation of its open citations database, with an integrated
SPARQL endpoint and a browsing interface to support data consumers.
This database, the OpenCitations Corpus (OCC), is an open repository of
scholarly citation data made available under a Creative Commons public
domain dedication (CC0), which provides accurate bibliographic references
harvested from the scholarly literature, described using the SPAR
Ontologies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] according to the OCC metadata document [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], that others may
freely build upon, enhance and reuse for any purpose, without restriction
under copyright or database law.
3.1
      </p>
      <sec id="sec-3-1">
        <title>The model</title>
        <p>
          The newly revised metadata model used for the data stored in the OCC,
available at [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and briefly summarised in Fig. 1, is explicitly aligned with
the SPAR Ontologies [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and other standard vocabularies. In particular:
10
https://opencitations.wordpress.com/2011/07/01/jisc-opencitations-project-–-final-project-blog-post/
11 http://www.sparontologies.net/
12 http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
13 https://is4oa.org/services/open-citations-corpus/
– the FRBR-aligned Bibliographic Ontology (FaBiO)14 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is used to
provide a description of all the metadata of citing/cited bibliographic
resources (conference papers, book chapters, journal articles, etc.) and
their related container resources (academic proceedings, books,
journals, etc.), and metadata about the particular formats in which they
have been embodied (digital vs. print, first and ending pages, etc.);
– the Publishing Roles Ontology (PRO)15 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is used to describe the roles
of bibliographic agents (author, editor, publisher, etc.) related to the
bibliographic resources, while the order among such roles, e.g. the list
of authors of a paper, is handled by extending PRO with an additional
property, i.e. oco:hasNext;
– the Bibliographic Reference Ontology (BiRO)16 and the Citation
Counting and Context Characterization Ontology (C4O)17 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] are used to
describe the textual content of each reference in the reference list of a
citing bibliographic resource;
– finally, the DataCite Ontology 18 is used to define all the identifiers
(e.g. DOI, PubMed ID, PubMed Central ID, ORCID, ISSN, etc.) for
bibliographic resources and the agents involved, while the Friend Of
A Friend (FOAF)19 ontology is used to define additional data about
agents, such as their given and family names.
        </p>
        <p>For convenience, all the terms from the aforementioned ontologies are
collected within a new ontology called the OpenCitations Ontology (OCO)20.
This is not yet another bibliographic ontology, but rather just a mechanism
for grouping existing complementary ontological entities from several other
ontologies for the purpose of providing descriptive metadata for the OCC.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The data</title>
        <p>
          The OCC stores metadata relevant to bibliographic citations in RDF,
encoded as JSON-LD21. In early September 2016, all the ingested data will
be also available as downloadable datasets. In the meantime, two
exemplar dataset, compliant with the OCC metadata model introduced in
Section 3.1, have been made available: the first from article metadata provided
by Springer Nature (available at [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]), and the second gathered from
Europe PubMed Central (available at [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ])22.
        </p>
        <p>The following six bibliographic entity types occur in the OCC, as well
as in the aforementioned exemplar datasets:
14 http://purl.org/spar/fabio
15 http://purl.org/spar/pro
16 http://purl.org/spar/biro
17 http://purl.org/spar/c4o
18 http://purl.org/spar/datacite
19 http://xmlns.com/foaf/spec/
20 https://w3id.org/oc/ontology
21 http://json-ld.org/
22 All the resources in the exemplar datasets have URLs that starts with “http:
//localhost:8000/corpus/” and do not refer to any existing IRI included in
the current version of the corpus.</p>
        <p>– bibliographic resources (br), class fabio:Expression – resources
that either cite or are cited by other bibliographic resources (e.g. journal
articles), or that contain such citing/cited resources (e.g. journals);
– resource embodiments (re), class fabio:Manifestation – details
of the physical or digital forms in which the bibliographic resources are
made available by their publishers;
– bibliographic entries (be), class biro:BibliographicReference –
the literal textual bibliographic entries occurring in the reference lists
within the bibliographic resources, that reference other bibliographic
resources;
– responsible agents (ra), class foaf:Agent – names of agents having
certain roles with respect to the bibliographic resources (i.e. names of
authors, editors, publishers, etc.);
– agent roles (ar), class pro:RoleInTime – roles held by agents with
respect to the bibliographic resources (e.g. author, editor, publisher);
– identifiers (id) (class datacite:Identifier) – external identifiers
(e.g. DOI, ORCID, PubMedID) associated with the bibliographic
entities.</p>
        <p>The corpus URL (https://w3id.org/oc/corpus/) identifies the entire
OCC, which is composed of several sub-datasets, one for each of the six
aforementioned bibliographic entities included in the corpus. Each of these
has a URL composed by sufixing the corpus URL with the two-letter short
name for the class of entity (e.g. “be” for a bibliographic entry) followed by
an oblique slash (e.g. https://w3id.org/oc/corpus/be/). Each dataset
is described appropriately by means of the Data Catalog Vocabulary23 and
the VoID Vocabulary24, and a SPARQL endpoint25 is made available for
all the entities included in the entire OCC.</p>
        <p>Upon initial curation into the OCC, a URL is assigned to each
entity within each sub-dataset, which can be accessed in diferent formats
(HTML, RDF/XML, Turtle, and JSON-LD) via content negotiation. Each
entity URL is composed by sufixing the sub-dataset URL with a
number assigned to each resource, unique among resources of the same type,
which increments for each new entry added to that resource class. For
instance, the resource https://w3id.org/oc/corpus/be/537 is the 537th
bibliographic entry recorded within the OCC. The final part of such URL,
i.e. the two-letter short name for the class of items plus “/” plus the number
(“be/537” in the example), is called the internal corpus identifier , since it
allows the unique identification of any entity within the OCC.</p>
        <p>Each of these entities has associated metadata describing its provenance
using the PROV-O26 ontology and its PROV-DC extension27 (e.g. https:
//w3id.org/oc/corpus/be/537/prov/se/1). In particular, we keep track
of the curatorial activities related to each OCC entity, the curatorial agents
involved, and their roles.</p>
        <p>All these RDF data are stored in BibJSON28 encoded as JSON-LD,
defined through an appropriate JSON-LD context 29 which hides the
complexity of the model (shown in Fig. 1) behind natural language keywords.
For instance, the following excerpt is the JSON-LD linearisation of the
aforementioned “be/537” entity:
{
}
"iri": "gbe:537",
"a": "entry",
"label": "bibliographic entry 537 [be/537]",
"content": "Svahn, HA, Berg, A. Single cells or large populations , Lab Chip
, 2007, 7, 544, 546, DOI: 10.1039/b704632b , PMID: 17476370",
"crossref": "gbr:1601"</p>
        <p>In this excerpt, “iri” defines the URL of the resource in consideration
(where “gbe:” is a prefix for “https://w3id.org/oc/corpus/be/”), while “a”,
“entry”, “label”, “content” and “crossref” stand for rdf:type, biro:Biblio
graphicReference, rdfs:label, c4o:hasContent and biro:references
respectively (where “gbr:” is a prefix for “https://w3id.org/oc/corpus/br/”).</p>
        <p>
          Additional information about OCC’s handling of citation data, and the
way they are represented in RDF, are detailed in the oficial OCC Metadata
Document [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
23 https://www.w3.org/TR/vocab-dcat/
24 https://www.w3.org/TR/void/
25 http://w3id.org/oc/sparql
26 https://www.w3.org/TR/prov-o/
27 https://www.w3.org/TR/prov-dc/
28 http://okfnlabs.org/bibjson/
29 https://w3id.org/oc/corpus/context.json
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>The ingestion workflow</title>
        <p>The ingestion of citation data into the OCC is handled by two Python
scripts called Bibliographic Entries Extractor (BEE) and the SPAR Citation
Indexer (SPACIN), available in the OCC’s GitHub software repository30.</p>
      </sec>
      <sec id="sec-3-4">
        <title>BEE – The Bibliographic Entries Extractor As shown Fig. 2, BEE is</title>
        <p>responsible for the creation of JSON files containing information about the
articles in the OA subset of PubMed Central (retrieved by using the Europe
PubMed Central API31). Each of these JSON files is created by asking
Europe PubMed Central about all the metadata of the articles it stores
that have available the source XML file. Once identified, BEE processes all
the XML sources so as to extract the complete reference list of each paper
under consideration, and includes all the data in the final JSON file. An
excerpt of one of those JSON files is as follows:
{
"doi": "10.1007/s10544 -016-0081-z",
"pmid": "27299468",
"pmcid": "PMC4908161",
"localid": "MED -27299468",
"curator": "BEE EuropeanPubMedCentralProcessor",
"source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4908161/
fullTextXML",
"source_provider": "Europe PubMed Central",
"references": [
...
{
"bibentry": "Svahn , HA, Berg , A. Single cells or large populations , Lab</p>
        <p>Chip , 2007, 7, 544, 546, DOI: 10.1039/b704632b , PMID: 17476370",
"pmid": "17476370",
"doi": "10.1039/ b704632b",
"process_entry": "True"
}
]</p>
        <p>In particular, for each articles retrieved by means of the Europe PubMed
Central API, BEE stores all the available bibliographic identifiers (in the
example, “doi”, “pmid”, “pmcid”, and “localid”) and all the textual
references, enriched by their own related bibliographic identifiers if those are
available. In addition, the JSON file also includes provenance information
about the source, its provider and the OCC curator (i.e. the particular
BEE Python class responsible for the extraction of these metadata from
the source). The created JSON files are then processed, independently, by
the tool presented in the next section.</p>
        <p>We have undertaken some tests to determine the performances of BEE
in generating these JSON files. In particular, we queried Europe PubMed
Central for the metadata of articles while running BEE for 30 minutes on
a MacBook Pro, with 2 GHz Intel Core i7 processor, 8 GB DDR3 1600
MHz, OS X 10.11.3. During that time, we were able to create 185 JSON
ifles containing all the aforementioned metadata, giving a rate of about 6
new JSON files per minute.
30 https://github.com/essepuntato/opencitations
31 https://europepmc.org/RestfulWebService</p>
      </sec>
      <sec id="sec-3-5">
        <title>SPACIN – The SPAR Citation Indexer SPACIN processes each</title>
        <p>JSON file created by BEE, retrieving additional metadata information
about all the citing/cited articles described in it by querying the
Crossref API32 and the ORCID API33. These API are also used to disambiguate
bibliographic resources and agents by means of the identifiers retrieved
(e.g., DOI, ISSN, ISBN, ORCID, URL, and Crossref member URL). Once
SPACIN has retrieved all these metadata, appropriate RDF resources are
created (or reused, if they have been already added in the past) and stored
in the file system in JSON-LD format (as shown in Section 3.2) and,
additionally, within the OCC triplestore. It is worth noting that, for space and
performance reasons, the triplestore includes all the data about the curated
entities, but does not store their provenance data nor the descriptions of
the datasets themselves – these are accessible only via HTTP, not via the
SPARQL endpoint.</p>
        <p>The SPACIN workflow, described in Fig. 2, is a process that runs until
no more JSON files are available from BEE. Thus, the current instance
of the OCC is evolving dynamically in time, and can be easily extended
beyond Europe PubMed Central by reconfiguring it to interact with
additional REST APIs from diferent sources, so as to gather new article
metadata and their related references.</p>
        <p>Each new resource recorded within the OCC by SPACIN occupies
between 0.3 and 4 KB, plus an additional 32 KB dedicated to storage of its
provenance data. Each day, the workflow adds about 2 million triples to
the corpus, describing more than 20,000 new citing/cited bibliographic
resources and about 100,000 new authors, 5% of whom are disambiguated
through their ORCID ids.
32 http://api.crossref.org/
33 http://members.orcid.org/api/</p>
        <p>We have tested the performances of SPACIN in processing the JSON
ifles generated by BEE and produce new RDF resources for the OCC. In
particular, we run SPACIN on two subsets of JSON file: 67 JSON files
describing all 67 papers included in the Proceedings of ISWC 2015, and
the first 67 JSON files produced by BEE from the Open Access subset of
PubMed Central as the outcome of the experiment described in Section 3.3.
We use the same configuration as before, i.e. a MacBook Pro, with 2 GHz
Intel Core i7 processor, 8 GB DDR3 1600 MHz, OS X 10.11.3.</p>
        <p>
          ISWC 2015 dataset. SPACIN took 45 minutes to process all 67
papers in the ISWC 2015 Proceedings, and the outcomes have been published
in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Each citing paper contained about 23 references on average, and
SPACIN produced 1,441 new citing/cited resources, for a total of 1,531
citation links. These resources are contained in 411 diferent container
resources (e.g. journals, proceedings, books), published by 42 distinct
publishers. The total number of authors is 3,076, 157 of whom (5.1%) have
been disambiguated through their ORCID. The total number of RDF
statements created is 69,995 (which, as explained, excludes provenance data and
datasets information), on average 1,044 triples per citing resource.
        </p>
        <p>
          Europe PubMed Central dataset. SPACIN took 210 minutes to
process 67 papers from Europe PubMed Central, and the outcomes have
been published in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Each citing paper contained about 50 references on
average, and SPACIN produced 3391 new citing/cited resources, for a
total of 3,337 citation links. These resources are contained in 1,047 diferent
container resources (e.g. journals, proceedings, books), published by 137
distinct publishers. The total number of authors is 21,658, 957 of whom
(4.4%) have been disambiguated through their ORCID identifiers. The
total number of RDF statements created is 377,237 (excluding, as before,
provenance data and datasets information), on average 5,630 triples per
citing resource. This number will reduce as the OCC becomes more fully
populated, since more cited resources will already be described within the
database.
        </p>
        <p>In Table 1 we summarise some metrics related to the resources included
in the aforementioned exemplar datasets. While these data are far from
having a full coverage, they provide interesting snapshots of these two
communities. On the one hand, the community of ISWC 2015 is composed by a
relatively small number of people. Even if the average number of references
per paper is quite small (average 23, the paper with the most references
having 47 citation links), there are several papers that were cited more than
one time (the most cited one received 7 citations). Many of the citations are
to resources for which Crossref was not able to return any metadata. This
is understandable, since many citations in these Semantic Web papers are
to Web documents (e.g. W3C Recommendations) and to workshop papers
not indexed by Crossref (e.g. CEUR Workshop Series34). Some of these
non-Crossref-indexed publications are well known and well cited within
this community (the most cited one has 4 citations within these 67 ISWC
papers).
34 http://ceur-ws.org/</p>
        <p>Property
Max. number of bibliographic
references within a paper</p>
        <p>Max. number of citations
received by a paper within this</p>
        <p>sample
Percentage of cited resources
for which Crossref did not</p>
        <p>return any metadata</p>
        <p>Max. number of citations
received by a cited resource for
which Crossref did not return
any metadata
47
7
44%
4
320
3
13%
1</p>
        <p>On the other hand, we see a quite diferent citation behaviour in the
Europe PubMed Central papers. In this case, as expected, the number of
average references is higher (average 50, with one review paper having 320
citation links). The paper within this small sample that has been cited most
received only 3 citations from the other 66 papers, and this is clearly due
to the dimension of the citation graph of the Biomedical and Life Science
community to which PubMed Central relates, which is clearly bigger and
more sparsely linked than the ISWC one. Additionally, these papers usually
cite others published in journals to which proper identifiers (e.g. DOIs) have
been assigned, explaining the lower percentage of citations to resources that
are not indexed by Crossref.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related works</title>
      <p>In recent years we have seen a growing interest within the Semantic Web
community for in creating and making available RDF datasets concerning
bibliographic metadata of scholarly documents. While the list of such works
is quite extensive, inIn this section we describe four of the most important
contributions in the area.</p>
      <p>
        The Semantic Lancet35 Project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims at building a Linked Open
Dataset of scholarly publication metadata starting from the articles
published by Elsevier. In particular, the current dataset contains SPAR-based
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] metadata about several papers published in the Journal of Web
Semantics36, including citation links marked with the motivations justifying them
by means of CiTO properties. It has several graphical interfaces that allow
browsing and sense-making of these data.
      </p>
      <p>
        Springer LOD37 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an RDF dataset made available by Springer
Nature that publishes Springer metadata about conferences as Linked Open
35 http://semanticlancet.eu/
36 http://www.journals.elsevier.com/journal-of-web-semantics/
37 http://lod.springer.com/
Data (LOD). Its main focus in on proceedings volumes and the related
conferences, but it does not contain metadata describing the individual articles
contained in such proceedings.
      </p>
      <p>
        OpenAIRE38 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is an Horizon 2020 open data project which
publishes metadata of more than 14 millions of publications and thousands
of datasets. It makes available a mechanism for searching, discovering and
monitoring scientific outputs.
      </p>
      <p>
        Finally, Scholarly Data39 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a new project that refactors the Semantic
Web Dog Food40 so as to keep the dataset growing in good health, and that
adopts the new Conference Ontology41 (aligned with other existing models
including SPAR [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) for describing the data.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper we have introduced the OpenCitations Project, which has
created an open repository of accurate bibliographic references harvested
from the scholarly literature: the OpenCitations Corpus (OCC). The new
instance of the OCC has recently been established, and is already
populated with data describing 595,222 citation links (as of August 24, 2016) –
a number that will grow quickly over the coming months as the
continuous workflow adds new data dynamically from Europe PubMed Central and
other authoritative sources. The OCC SPARQL endpoint is presently
available for use, and distributions of the OCC datasets will shortly be made
openly available for bulk download – the first of these by early September
2016, with subsequent incremental additions.</p>
      <p>We are currently working on two diferent aspects. First of all, we are
developing tools for linking the resources within the OCC with those
included in other datasets, e.g. Scholarly Data and Springer LOD. In
addition, we are experimenting with the use of multiple parallel instantiations
of SPACIN, so as to increase the amount of new information that can be
processed daily into OCC.</p>
      <p>
        Acknowledgements. All the scripts used in OpenCitations have been
developed as outcomes of several personal communications with people
responsible for the external services that OCC uses. We would like to thank
leading people in Europe PubMed Central (in particular Johanna McEntyre
and Vid Vartak), Crossref (in particular Ed Pentz, Geofrey Bilder, and
Karl Ward) and ORCID (in particular Josh Brown and Laurel Hack) for
their help. We would also like to thank Alfred Hofmann and Aliaksandr
Birukou (Springer Nature) for allowing us to publish in Figshare the OCC
metadata concerning the Proceedings of ISWC 2015 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which they kindly
provided us in XML.
38 https://www.openaire.eu/
39 http://www.scholarlydata.org/
40 http://data.semanticweb.org/
41 https://w3id.org/scholarlydata/ontology/conference-ontology.owl
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alexiou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vahdati</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papastefanatos</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>OpenAIRE LOD services: Scholarly Communication Data as Linked Data</article-title>
          . To appear
          <source>in Proceedings of SAVE-SD</source>
          <year>2016</year>
          . http://cs.unibo.it/save-sd/
          <year>2016</year>
          /papers/html/alexiou-savesd2016.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bagnacani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciancarini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di</surname>
            <given-names>Iorio</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            ,
            <surname>Peroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Vitali</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The Semantic Lancet Project: A Linked Open Dataset for Scholarly Publishing</article-title>
          .
          <source>In EKAW 2014 Satellite Events:</source>
          <fpage>101</fpage>
          -
          <lpage>105</lpage>
          . http://dx.doi.org/ 10.1007/978-3-
          <fpage>319</fpage>
          -17966-7_
          <fpage>10</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bryl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kessler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>What's in the proceedings? Combining publisher's and researcher's perspectives</article-title>
          .
          <source>In Proceedings of SePublica</source>
          <year>2014</year>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1155</volume>
          /paper-01.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Di</given-names>
            <surname>Iorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            ,
            <surname>Peroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Shotton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Vitali</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Describing bibliographic references in RDF</article-title>
          .
          <source>In Proceedings of SePublica</source>
          <year>2014</year>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1155</volume>
          /paper-05.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Falco</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitali</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Modelling OWL ontologies with Grafoo</article-title>
          .
          <source>In The Semantic Web: ESWC 2014 Satellite Events:</source>
          <fpage>320</fpage>
          -
          <lpage>325</lpage>
          . http://dx.doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -11955-7_
          <fpage>42</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nuzzolese</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gentile</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Presutti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Conference Linked Data - Our Web Dog Food has gone gourmet</article-title>
          . To appear
          <source>in Proceedings of ISWC</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The Semantic Publishing and Referencing Ontologies</article-title>
          . In Semantic Web Technologies and Legal Scholarly Publishing:
          <fpage>121</fpage>
          -
          <lpage>193</lpage>
          . http: //dx.doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -04777-
          <issue>5</issue>
          _
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dutton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Setting our bibliographic references free: towards open citation data</article-title>
          .
          <source>Journal of Documentation</source>
          ,
          <volume>71</volume>
          (
          <issue>2</issue>
          ):
          <fpage>253</fpage>
          -
          <lpage>277</lpage>
          . http://dx.doi.org/10.1108/JD-12-2013-0166
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>FaBiO and CiTO: ontologies for describing bibliographic resources and citations</article-title>
          .
          <source>In Journal of Web Semantics</source>
          ,
          <volume>17</volume>
          :
          <fpage>33</fpage>
          -
          <lpage>43</lpage>
          . http://dx.doi.org/10.1016/j.websem.
          <year>2012</year>
          .
          <volume>08</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Exemplar OCC dataset from Europe PubMed Central metadata</article-title>
          . Figshare. https://dx.doi.org/10.6084/m9.figshare. 3481922
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Exemplar OCC dataset from Springer Nature metadata</article-title>
          . Figshare. https://dx.doi.org/10.6084/m9.figshare.3481949
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Metadata for the OpenCitations Corpus</article-title>
          . Figshare. https://dx.doi.org/10.6084/m9.figshare.3443876
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Peroni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitali</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Scholarly publishing and the Linked Data: describing roles, statuses, temporal and contextual extents</article-title>
          .
          <source>In Proceedings of i-Semantics</source>
          <year>2012</year>
          :
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          . http://dx.doi.org/10.1145/2362499. 2362502
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2013</year>
          ). Open Citations.
          <source>Nature</source>
          ,
          <volume>502</volume>
          (
          <issue>7471</issue>
          ):
          <fpage>295</fpage>
          -
          <lpage>297</lpage>
          . http:// dx.doi.org/10.1038/502295a
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>