<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PLoS Comput. Biol. 12 (2016). URL: https://doi.org/10.
1371/journal.pcbi.1004989.
[7] N. Queralt</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Giachelle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Marchesin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Menotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>11799</volume>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Nanopublications are RDF graphs that enable the possibility of sharing machine-readable assertions on the Web while tracking their provenance and publication information. However, the current nanopublication model focuses on the provenance of single-source assertions derived from a specific publication or database. This work proposes extending the nanopublication model to include a fourth component called knowledge provenance. Knowledge provenance captures the context where an assertion is not derived from a single publication but from a body of knowledge that can comprehend supporting and conflicting pieces of evidence that we need to track and refer to. We apply the defined model to the facts generated by the Collaborative Oriented Relation Extraction (CORE) and published 197, 511 assertions in the form of extended nanopublications, allowing the identification, representation, access, and citation of individual gene expression-cancer associations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Nanopublications</kwd>
        <kwd>Knowledge Provenance</kwd>
        <kwd>Data Provenance</kwd>
        <kwd>Knowledge Bases</kwd>
        <kwd>Gene-Cancer Associations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Given the high volume of publications, scientific evidence is often extracted automatically
and organized into Knowledge Bases (KBs), which are widely applicable because they are
understandable to humans and machines [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Making sure each statement is accessible on its
own is key to create a unified resource that contains all the relevant knowledge in a specific field.
This approach allows for easy retrieval, access, and reference to specific data points we mention.
To accomplish this goal, the nanopublication model proves particularly efective as it enables
the identification, representation, access, and citation of individual assertions [ 3, 4]. This model
sees extensive use in representing statements, especially in the life science domain [5, 6, 7].
The structure of the nanopublication model consists of three named graphs, each containing
information about the assertion, its provenance, and details about the nanopublication itself.
While the nanopublication model is well-suited for single assertions originating from a single
source of evidence, the provenance graph in the nanopublication model specifically addresses
what is termed "Information Provenance." However, when dealing with information derived
from the aggregation of multiple sources, challenges arise. In such cases, supporting and
conflicting evidence exists, and each assertion may potentially be associated with a reliability
score computed in diverse ways. Hence, there is a need to keep track of the provenance of
each piece of information used to generate assertions by systems exploiting an aggregation of
multiple sources of evidence.
      </p>
      <p>This limitation of the nanopublication model is more evident when considering the assertions
of a large-scale knowledge discovery platform storing more than 230K gene expression-cancer
associations: CoreKB [8]. 1 CoreKB stores facts generated by the CORE system, which
undertakes biomedical literature and extracts detailed aspects from various evidence sources to
produce scientific facts suitable for publication each as a Gene Cancer Status ( GCS). Creating a
GCS involves extracting information about a gene expression related to a specific disease from
numerous sentences in multiple articles. Thus, supporting and conflicting evidence is
aggregated to determine the most probable scientific assertion related to the pool of sentences [ 9].
Each fact generated by CORE can be published as a nanopublication following the standard
model [10]. However, the provenance graph of the current nanopublication model does not
provide enough information to link the GCS and the role of each evidence sentence from the
literature. In this work, we extend the nanopublication model to account for multi-source
assertions by introducing a novel component to its structure called knowledge provenance. The
knowledge provenance graph describes which pieces of information contributed to support or
confute the considered assertion. In addition, it includes information about the reliability of
each assertion based on each source of information and its extraction process. It is important
to note that in the proposed model, the components of the original nanopublication model
remain unchanged, allowing for backward compatibility. We model knowledge provenance as an
additional named graph of a nanopublication, defined according to an appropriate ontology. In
this regard, we define the PROV-K ontology 2, a general resource to represent the provenance
information of assertions derived from multiple sources of evidence. The PROV-K ontology
is an integration of the PROV Ontology (PROV-O) and is grounded in the literature defining
knowledge provenance [11, 12, 13]. To show the applicability of the proposed model, we serialize
all facts in CoreKB as extended nanopublications. We published 197, 511 extended
nanopublications representing the facts in CoreKB, which can be browsed in the CoreKB platform and
downloaded separately from the same platform or in bulk in Zenodo [14]. We also release the
source code 3 for building the extended nanopublications, which can be used as a template for
future applications on diferent resources.</p>
      <p>The rest of this work is organized as follows. Section 2 introduces the original nanopublication
model and describes previous eforts in data, information, and knowledge provenance. Section
4 presents an in-use and large-scale knowledge discovery platform highlighting the limitations
of the current nanopublication model when dealing with multi-source assertions. Section 3
defines the extended nanopublication model accounting for knowledge provenance. Section</p>
      <sec id="sec-1-1">
        <title>1https://gda.dei.unipd.it/</title>
        <p>2Publicly available here: https://prov-k.dei.unipd.it/ontology/
3https://github.com/mntlra/knowledgeProvenance
4 describes the serialization of extended nanopublications starting from the facts in CoreKB.
Section 5 draws some final remarks.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The nanopublication model aims to facilitate the integration, exchange, accessibility and
comprehension of scientific statements, and enable citations at the granularity of individual claims [ 3, 4].
Following this framework, a scientific publication can be divided into single statements or
assertions, with each assertion encapsulated in a distinct nanopublication containing all pertinent
information about that specific claim. Using Semantic Web technologies, the nanopublication
model represents scientific claims in a distinctive, identifiable, citable, and reusable format.
Representing data as nanopublications enhances data-intensive science and allows for fact
discovery exploiting machine-readable information [15]. From a technical viewpoint, a
nanopublication is a named graph that comprises three basic components; each represented as a
named graph itself: (i) the assertion graph, containing the scientific assertion; (ii) the provenance
graph, containing information about where the assertion comes from and how it has been
defined; (iii) the publication info graph, containing all the metadata of the nanopublication,
such as who curated it and when it was created. The components of the nanopublication are
interconnected using a fourth graph called the head graph.</p>
      <p>The nanopublication model has been used to represent statements from diferent fields,
especially in the life science domain. Chichester et al. [16] created nanopublications from
scientific facts associated with more than 38K proteins stored in the neXtProt database. 4 This
approach showed that using the nanopublication model for the neXtProt database eases access to
its information and can be a useful tool for expanding biological research [5]. Queralt-Rosinach
et al. [7] published the contents of the DisGeNET database 5 as nanopublications to provide
a Linked Data resource. Waagmeester et al., in [6], described their endeavors in converting
WikiPathways, an online collaborative pathway resource, into nanopublications. 6 Overall,
there are more than 10M nanopublications publicly accessible worldwide [17]. Concerning
the aggregation of multiple nanopublications, Bucur et al. [18] proposed an approach where
nanopublications representing snippets of scientific articles related to the same publication
are interlinked, utilizing properties like refersTo. Albeit the unifying model proposed in [18] is
relevant to our study, it still does not consider the reliability of an assertion and the supporting
or conflicting relationships between pieces of information. The concept of nanopublications
has already been expanded in [19]. Here, the assertion graph has been extended to account for
English sentences representing textual scientific claims following a semantic scheme called AIDA
(Atomic, Independent, Declarative, Absolute). However, we are interested in machine-readable
representations, like the nanopublication model.</p>
      <p>In the era of truth discovery algorithms and automatic information extraction, the
nanopublication model fails to represent data reliability and the provenance of assertions constituted by an
ensemble of contrasting and supporting evidence. In this regard, Clark et al. [20] formalized the</p>
      <sec id="sec-2-1">
        <title>4https://www.nextprot.org/</title>
        <p>5https://www.disgenet.org/rdf
6https://github.com/wikipathways/nanopublications
micropublication model, which represents empirical evidence beyond statement-based models
like nanopublications. The proposed model ofers a representation of biomedical evidence with
particular interest in building claim networks and their lineage. Although related, the work by
Clark et al. only targets the modeling of the biomedical communications ecosystem, including
reproducibility and verifiability in research – which is out of scope for this study. Besides, the
micropublication model represents the claim of a statement in textual form, as it happens also
for AIDA nanopublications [19].</p>
        <p>The Data–Information–Knowledge–Wisdom (DIKW) pyramid is a widely recognized model
for representing information and knowledge within management systems [21]. It describes the
processes involved in the data transformation, from a piece of data to the wisdom embedded
in it. Each step adds value to the final results, starting from raw Data, where one can extract
Information, to Wisdom, that is the application of Knowledge acquired from the information
block. We establish a connection between the DIKW pyramid and provenance. In earlier studies,
the initial level, known as Data Provenance, has received extensive attention within databases.
Its primary emphasis lies in tracing the data lineage in response to a query [22, 23]. In this
context, Provenance encompasses the origin and the pathway through which a specific piece
of data was introduced into the given database. Over the years, various conceptualizations of
provenance have been proposed and explored, such as “why-provenance,” “where-provenance,”
and “how-provenance” [22, 23].</p>
        <p>
          The second level concerns Information Provenance, which represents the provenance of
assertions inferred from data. This is embedded in the provenance graph of the nanopublication
model and has been studied in the context of the Semantic Web. Provenance on the Semantic
Web comprises metadata representing the creation and publication of resources. The PROV
Ontology (PROV-O) 7 provides a formal language to encode provenance information in a
machine-readable format. It is based on the PROV Data Model (PROV-DM) 8 and the Open
Provenance Model (OPM) [
          <xref ref-type="bibr" rid="ref3">24</xref>
          ]. While extensive in scope, the PROV-O models provenance as in
the provenance graph of the nanopublication model; therefore, it lacks the representation of
supporting and contradicting evidence and reliability scores.
        </p>
        <p>The third level, called Knowledge Provenance, is the focus of this work, and it has been studied
in diferent works by Fox and Huang [ 25, 12, 13, 26]. Knowledge Provenance (KP) has been
proposed to create an approach to annotate the reliability of information extracted from web
sources based on who created the assertion, how much the creator can be trusted, and what
the information depends on. Little work has been done towards this end, and it mostly focuses
on providing a taxonomy of four levels of provenance based on the certainty degree of each
assertion [25]: Static KP (Level 1) for assertions for which the truth value does not change
over time [25]; Dynamic KP (Level 2) allowing the validity of information to change over
time [12]; Uncertainty-oriented KP (Level 3) considering truth values and relationships that are
uncertain [13]; Judgment-based KP (Level 4) for provenance supported by social processes, e.g.,
truth propagation in social networks [26]. The Static Knowledge Provenance Ontology defines a
taxonomy of proposition types and a set of axioms allowing the development of a reasoner to
assess truth values based on diferent cases [ 11]. Although the ontology has been formalized
7http://www.w3.org/TR/2013/REC-prov-o-20130430/
8https://www.w3.org/TR/prov-dm/
in [11], it is not available as a resource.</p>
        <p>The fourth level, Wisdom Provenance, focuses on keeping track of the provenance of the
wisdom inferred from knowledge or applications exploiting such knowledge, which is still
unexplored in the literature.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Extended Nanopublication Model</title>
      <p>We extend the original nanopublication model by introducing a novel component named
knowledge provenance, which allows for the tracking of provenance for multi-sourced assertions.
In this context, “knowledge provenance” is the nanopublication component describing which
pieces of information contributed to support or confute the assertion. It can also include
information about the reliability of the assertion and its assigned certainty degree. It is important
to note that our approach does not change the other components of the original nanopublication
model.</p>
      <p>Since each module of a nanopublication is a named graph, we model knowledge provenance
as a named graph according to an ontology called the PROV-K ontology. 9 We designed the
9The PROV-K ontology and its complete documentation are available at https://prov-k.dei.unipd.it/ontology/
PROV-K ontology by extending PROV-O 10 to represent provenance information of assertions
derived from multiple sources using an aggregation algorithm. Note that the PROV-K ontology
is a standalone resource, meaning it can also be used independently of nanopublications to
represent provenance. We developed the PROV-K ontology following the guidelines provided
by the Static KP ontology [11] with the addition of some elements from Dynamic KP, such as
timestamps for truth values and trust relationships [12]. We also incorporate the concepts of
assigned certainty degree and certainty degree from Uncertainty-oriented KP [13]. Representing
truth values as probability distributions is more meaningful for our study than the static KP
assumption, where the truth value is a categorical variable. We also expanded the concepts from
Uncertainty-oriented KP to represent reliability testing. In this way, we can classify assertions
into reliable or unreliable facts and include the conditions an assertion may fail to satisfy.
We describe the PROV-K ontology based on four main areas: Propositions, Digital Signature
and Information Sources, Trust Relationships, and Truth Value. Figure 1 reports the ontology
schema.</p>
      <p>Proposition. The central unit of the PROV-K ontology is the “Proposition”, which is defined
as “the smallest piece of information to which provenance-related attributes may be ascribed” and
as “a declarative sentence that is either true or false” [11]. In our context, the nanopublication
model’s assertion graph can be considered the proposition. Since the PROV-K ontology extends
PROV-O, we model propositions as subclasses of prov:Entity. We also define a taxonomy of
propositions based on [11], which diferentiates between independent and dependent
propositions, i.e., assertions whose truth value depends upon other propositions. Each proposition can
be supported by or conflicting with other knowledge sources. To encompass this situation, we
defined two object properties called supportedBy and conflictingWith, both with range
class “Sentence” from the Semanticscience Integrated Ontology (SIO). 11 Each proposition can
be linked to one or more knowledge fields with the object property dc:subject from the
Dublin Core (DC) Metadata Items. 12
Information Source and Signature. To determine the provenance of a proposition, it may
be useful to represent the document in which the considered proposition appears. We link the
proposition to the “document” it belongs to with the object property prov:wasDerivedFrom
from PROV-O. In this way, we allow a proposition to belong to any entity, e.g., a textual
document or a dataset. For any document and any proposition, one can define its creator with
the object property dc:creator from the DC Metadata Items, with range class prov:Agent
from PROV-O. In this way, the creator of a proposition or a document may be any agent, e.g., a
person or a digital artifact. We also represent the digital signature and signature status that can
be assigned to a proposition. We apply the PROV-K ontology to model knowledge provenance in
the context of nanopublications. However, the PROV-K ontology is a general resource designed
to track provenance information for aggregated sources of evidence beyond the knowledge
provenance graph of the nanopublication model. Thus, we defined the “Signature" class within
10https://www.w3.org/TR/prov-o/
11http://semanticscience.org/resource/SIO_000113
12http://purl.org/dc/terms/subject</p>
      <p>Assigned Truth Value
PROV-K:AssignedTruthValue
rdfs:SubClassOf</p>
      <p>Reliable Fact</p>
      <p>PROV-K:ReliableFact</p>
      <p>Unreliable Fact
PROV-K:UnreliableFact</p>
      <p>PROV-K:unmetCondition
PROV-K:unreliabilityReason xsd:string</p>
      <p>Reliability Condition</p>
      <p>PROV-K:ReliabilityCondition
rdfs:SubClassOf</p>
      <p>Insufficient Evidence
PROV-K:InsufficientEvidence</p>
      <p>Contrasting Evidence
PROV-K:ContrastingEvidence</p>
      <p>PROV-K:conditionThreshold
PROV-K:conditionScore
xsd:float
xsd:float
rdfs:SubClassOf</p>
      <p>PROV-K:sufficiencyCriteria</p>
      <p>Sufficiency Condition
PROV-K:SufficiencyCondition</p>
      <p>Sufficiency Criteria</p>
      <p>PROV-K:SufficiencyCriteria</p>
      <p>PROV-K:consistencyCriteria
PROCVo-Kns:CisotennscisyteCnocnydCiotionndition PROCVo-Kns:CisotennscisyteCnrcityeCriraiteria</p>
      <p>PROV-K
Taxonomy
PROV-K
Taxonomy
the ontology rather than solely relying on the modeling provided by the nanopublication model.
Nevertheless, when we apply such an ontology to the nanopublication model, we can represent
digital signatures with the “Nanopub Signature Element” 13 class from nanopubx.
Trust Relationships. We identify two trust relationships, one between two agents, which
are referred to as provenance requester and information creator (class “InfoCreatorTrust"), and
another between an agent and a proposition (class “PropositionTrust"). The former is defined as
“the provenance requester a “trusts” information creator c in a specific knowledge efild f”, where
“trust” means “a believes any proposition created by c in field f to be true”. The latter takes the
form of “Proposition x is trusted by an agent a” [11]. We also include the timestamp reporting
when the trust relationship was issued using the data property prov:atTime from PROV-O.
This work focuses on tracking the provenance of each piece of evidence and determining the
reliability of a given fact. Nevertheless, the PROV-K ontology can be easily expanded to account
for more complex trust relationships and decision processes.</p>
      <p>Truth Values. Fox and Huang defined two types of truth values: the assigned truth value and
the trusted truth value [11]. The former refers to the truth value assigned to the proposition by
its creator. At the same time, the latter identifies the truth value evaluated by an external agent
called “provenance requester”. In Static KP, the truth value of a proposition can be “True”, “False”,
or “Unknown”. Thus, we link to each proposition the class TruthValue with object property
hasTruthValue, where one can store both the assigned and trusted truth values defined in
[11]. We also represent the timestamp reporting when the truth value was assigned or trusted
with the data property prov:atTime from PROV-O. We define the assigned certainty degree as
the probability that the proposition’s creator assigns the truth value of “True” to the proposition
and the certainty degree as the probability that an agent evaluates the trusted truth value as
13http://purl.org/nanopub/x/NanopubSignatureElement
“True” [13]. Based on the assigned certainty degree, we may classify propositions as reliable
or unreliable. For this reason, we expand the concept of “assigned truth value” to account for
reliability testing. We report the ontology schema for the assigned truth value in Figure 2.
We classify the assigned truth value as “Reliable Fact” or “Unreliable Fact”, where the latter
identifies propositions failing some pre-defined reliability tests. One can represent whether the
proposition is unreliable due to insuficient evidence (subclass InsufficientEvidence) or
due to contrasting evidence (subclass ContrastingEvidence). We also include a data property
called unreliabilityReason to describe why the proposition is deemed as “unreliable”, as
well as an object property called unmetCondition to specify the reliability conditions the
proposition fails to respect. Each condition has a score, a threshold, and a criteria which
describes how the reliability condition works. For instance, in CORE we have two suficiency
and one consistency criteria, which are modeled as named individuals of type skos:Concept
and either SufficiencyCriteria or ConsistencyCriteria.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Real-World Use Case</title>
      <p>CoreKB. CORE is a Knowledge Base Construction (KBC) system based on the combination
of ML-based models and domain-expert feedback [9]. CORE harvests text from the literature,
identifies sentences containing pairs of relevant entities, and extracts fine-grained aspects from
them to generate gene expression-cancer associations that can be published as facts – i.e., GCS.
For each fact, CORE combines three aspect probabilities to assign the gene class likelihood
to the three mutually exclusive gene classes: oncogene, tumor suppressor gene, and
biomarker. Then, the system performs a two-stage reliability test that, for each fact, first
verifies that the fact has suficient evidence and subsequently checks that mutually exclusive
classes are not similarly probable, i.e. it assesses the degree of contradicting evidence. In this
way, unreliable facts can be fed back to domain experts for manual annotations in an active
learning paradigm, making CORE suitable to iterative KB versioning. For technical details and
the evaluation of the CORE system, we resort the interested reader to [9].</p>
      <p>The data extracted by CORE, and then ingested by CoreKB, contains information about
23,879 genes and 11,530 diseases for a total of more than 230K fine-grained facts supported by
1,037,845 sentences from 251,038 research articles. Figure 3 shows a GCS card displayed by the
CoreKB platform Each GCS comprises information about the gene and disease involved, which
are identified by National Center for Biotechnology Information ( NCBI) Gene IDs and Unified
Medical Language System (UMLS) Concept Unique Identifiers ( CUIs) respectively, together with
the assigned gene class. In addition, each GCS is linked to the sentences supporting the fact, i.e.
identifying the same gene class, and those conflicting with it. For each sentence, provenance
information includes the PubMed ID of the article from which the sentence has been extracted
and the year of publication of such an article. CoreKB comprises three types of GCS: reliable
facts, unreliable facts due to insuficient evidence, and unreliable facts due to low consensus
(contrasting evidence). The former are facts that passed the reliability tests performed by CORE,
while the others are facts that failed any of the two checks performed by the testing component.</p>
      <p>Facts generated by CORE can be published as nanopublications since each GCS can be
viewed as an assertion graph on its own. Giachelle et al. [10] showed the serialization of a</p>
      <p>GCS in CoreKB following the standard nanopublication model. In summary, the assertion
graph comprises the GCS itself in the form of an Resource Description Framework (RDF) graph,
the publication info graph includes metadata about the nanopublication, and the provenance
graph describes how the assertion was derived. However, by publishing facts within CoreKB as
classic nanopublications we cannot represent supporting and conflicting sentences and embed
information about their reliability. For instance, consider the GCS in Figure 3, the standard
nanopublication model fails to represent that the assertion is derived from the ensemble of 83
supporting sentences and 41 conflicting sentences. Moreover, it cannot include information
about the reliability of such an assertion, i.e., it cannot convey the information that the GCS is
reliable as the probability that gene “AKT1” is an oncogene for “bladder neoplasm” is 0.82.
Creation of the extended nanopublications. The extended nanopublication model has
been applied to serialize all the facts in CoreKB as nanopublications with knowledge
provenance. We followed the same methodology for the original nanopublication components used
for the DisGeNET nanopublications [7]. Figure 4 shows a GCS modeled with the extended
nanopublication model and serialized in TriG format. 14</p>
      <p>A nanopublication representing the facts generated by CORE comprises five named graphs:
head, assertion, provenance, publication information, and knowledge provenance. The head graph
connects all the components by linking the nanopublication URI to its subgraphs. The assertion
graph contains the GCS in RDF format. The provenance graph includes the information
prove14Access at:
https://gda.dei.unipd.it/cecore/resource/nanopub/3262e08b519b5e61b244c7e42c001d98/.</p>
      <p>@prefix cegcs: &lt;http://gda.dei.unipd.it/cecore/resource/GCS#&gt; .
@prefix cesent: &lt;http://gda.dei.unipd.it/cecore/resource/Sentence#&gt; .
@prefix corekp: &lt;http://gda.dei.unipd.it/cecore/resource/nanopub/PROV-K/&gt; .
…
@prefix PROV-K: &lt;ttps://w3id.org/PROV-K/ontology/schema/&gt; .
sub:head {
this: a np:Nanopublication ;
np:hasAssertion sub:assertion ;
np:hasProvenance sub:provenance ;
np:hasPublicationInfo sub:publicationInfo ;
np:hasKnowledgeProv sub:knowledgeProv . }
sub:assertion { … }
sub:provenance { … }
sub:publicationInfo { … }
sub:knowledgeprov {
sub:assertion PROV-K:hasTruthValue corekp:3262e08b519b5e61b244c7e42c001d98 ;</p>
      <p>PROV-K:conflictingWith cesent:0aab5b916e37fed4e9a2a48104af848d,
…
cesent:f3dfdef16032046ce21e3bb7eebfabe9 ;
PROV-K:supportedBy cesent:008a43c727cb5c31a071c3161e1b246f,
…
cesent:fed4a67bd6d534926a7f5c152e3aa827 .
corekp:3262e08b519b5e61b244c7e42c001d98 a PROV-K:ReliableFact ;</p>
      <p>PROV-K:assignedCertaintyDegree "0.81542736"^^xsd:float ;</p>
      <p>PROV-K:assignedCertaintyDegreeSupport 83 . }
nance and evidence used to build the GCS. In our case, all facts are derived from CoreKB and
are generated automatically (class “Automatic Assertion"). 15 Since the facts generated by CORE
often integrate more than one source of evidence (i.e., sentences), the source evidence is an
instance of class “Combinatorial Evidence" from the Evidence and Conclusion Ontology (ECO). 16
The publication information graph includes the general topic of the nanopublications,
information about the authors of the nanopublications, and the used dataset. Since CoreKB comprises
ifne-grained gene expression-cancer associations, the general topic for all nanopublications
is “gene-disease association linked with altered gene expression” from SIO. 17 The knowledge
provenance graph includes all supporting and conflicting sentences and information about the
reliability of the GCS. For instance, the GCS in Figure 4 is a reliable fact. Hence, its truth value
15http://purl.obolibrary.org/obo/ECO_0000203
16http://purl.obolibrary.org/obo/ECO_0000212
17http://semanticscience.org/resource/SIO_001123
is an instance of class “ReliableFact" and we report its assigned certainty degree and support.</p>
      <p>To represent the assertion graph, we rely on the ontology underlying KBs generated by
CORE, 18 while for the provenance graph we rely on PROV-O. 19 For the authorship and
versioning, we employ the Provenance, Authoring, and Versioning (PAV) vocabulary [27], and
for the description of the used datasets we employ the Provenance Vocabulary Core ontology
Specification ( PRV) [28]. The evidence annotation is described using the Weighted Evidence (WI)
vocabulary, 20 which comprises the object property wi:evidence to link the assertion to its
evidence, and the ECO ontology. 21 For the description of the topic of the nanopublications
and the process used to build the assertion, we use the SIO ontology. 22 For the knowledge
provenance graph, we extended the PROV-K ontology to include the reliability condition
defined by the CORE system. Specifically, we added two suficiency criteria and one consistency
criterion. A fact generated by CORE passes the suficiency checks if the probability of Change
of Cancer Status (CCS) and Gene-Cancer Interaction (GCI) being not informative is below a
threshold value  set to 0.7. The consistency test instead checks whether the diference between
the probabilities of the fact being classified with the two gene classes with the highest likelihood
is bigger than a threshold value  set to 0.4.</p>
      <p>To build the extended nanopublications, we extended the Python package nanopub. 23
We kept the provenance, publication information, and assertion graph unchanged to provide
backward compatibility with the original nanopublication model. In addition, we developed
a Python package to publish the facts in CoreKB as extended nanopublications serialized in
TriG syntax. The code can take as input two CSV files comprising the facts and the sentences
supporting or conflicting with it, or one can provide a Turtle (.ttl) file comprising the CoreKB
dump available in Zenodo [29]. The code for serializing the facts in CoreKB as extended
nanopublications can also serve as a template for future applications on diferent resources.</p>
      <p>CoreKB comprises 231,099 GCS, which can be divided into reliable facts, unreliable due
to insuficient evidence, and unreliable due to low consensus. We filter out unreliable facts
due to insuficient evidence, as publishing them as independent publications provides little to
no information. As a result, we published 197,511 facts from CoreKB as extended
nanopublications, accounting for 156,172 reliable facts and 41,339 unreliable ones due to contrasting
18http://gda.dei.unipd.it/cecore/ontology/
19http://www.w3.org/TR/prov-o/
20http://www.evidenceontology.org/
21https://ontobee.org/ontology/ECO
22https://ontobee.org/ontology/SIO
23https://github.com/fair-workflows/nanopub
evidence. Table 1 shows the gene class distribution of the facts in CoreKB serialized as extended
nanopublications. The extended nanopublications are also available in Zenodo [14].</p>
      <p>We include serialized nanopublications into the CoreKB platform to ease facts visualization.
For each GCS, one can explore the serialized nanopublication by clicking on the eye icon placed
in the drop-down list on the right side of the claim (see point B in Figure 3). 24 The visualization
depicts each component with a diferent color and displays URIs redirected to a functioning
website containing the description of the considered element. One can also download the
extended nanopublication representing a specific GCS thanks to the download button (see point
A in Figure 3).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Final Remarks</title>
      <p>This work extends the current nanopublication model to include a novel component called
knowledge provenance, accounting for the provenance information of assertions derived from
the aggregation of multiple sources of evidence. We described knowledge provenance as a named
graph tracking the provenance of each piece of information that contributes to support or
confute the assertion. To support the semantics of the knowledge provenance graph, we designed
the PROV-K ontology, an integration of PROV-O representing provenance information of
assertions derived from the aggregation of multiple sources. The PROV-K ontology is a general
resource designed to track provenance information for aggregated sources of evidence beyond
the context of nanopublications. We applied the proposed model by serializing more than 197K
facts in CoreKB and publishing them as extended nanopublications. Such nanopublications can
be easily browsed and downloaded through the CoreKB platform. The serialization of facts
in CoreKB can also serve as a template for applying the extended nanopublication model on
diferent resources.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This project has received funding from the HEREDITARY Project, as part of the European
Union’s Horizon Europe research and innovation programme under grant agreement No GA
101137074.
August 2005, Copenhagen, Denmark, IEEE Computer Society, 2005, pp. 524–528. URL:
https://doi.org/10.1109/DEXA.2005.193.
[27] P. Ciccarese, S. Soiland-Reyes, K. Belhajjame, A. J. G. Gray, C. A. Goble, T. Clark, PAV
ontology: provenance, authoring and versioning, J. Biomed. Semant. 4 (2013) 37. URL:
https://doi.org/10.1186/2041-1480-4-37.
[28] O. Hartig, J. Zhao, Publishing and consuming provenance metadata on the web of linked
data, in: Proc. of the Provenance and Annotation of Data and Processes - Third International
Provenance and Annotation Workshop, IPAW 2010, Troy, NY, USA, June 15-16, 2010.
Revised Selected Papers, volume 6378 of Lecture Notes in Computer Science, Springer, 2010,
pp. 78–90. URL: https://doi.org/10.1007/978-3-642-17819-1_10.
[29] S. Marchesin, L. Menotti, G. Silvello, O. Alonso, CORE: Gene Expression-Cancer Knowledge
Base, Zenodo, 2023. URL: https://doi.org/10.5281/zenodo.7577127.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <article-title>Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases, Found</article-title>
          .
          <source>Trends Databases</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>108</fpage>
          -
          <lpage>490</lpage>
          . URL: https://doi.org/10.1561/1900000064.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Generations of knowledge graphs: The crazy ideas and the business impact</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>4130</fpage>
          -
          <lpage>4137</lpage>
          . URL: https://doi.org/10.14778/3611540.3611636.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>24The serialized nanopublication representing the GCS used as an example throughout the paper can be visualized</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>