1. Introduction

PLoS Comput. Biol. 12 (2016). URL: https://doi.org/10. 1371/journal.pcbi.1004989. [7] N. Queralt

Extending Nanopublications with Knowledge Provenance for Multi-Source Scientific Assertions

Fabio Giachelle

Stefano Marchesin

Laura Menotti

Gianmaria Silvello

0 0 Department of Information Engineering, University of Padua , Padua , Italy

2025

11799 20 21

Nanopublications are RDF graphs that enable the possibility of sharing machine-readable assertions on the Web while tracking their provenance and publication information. However, the current nanopublication model focuses on the provenance of single-source assertions derived from a specific publication or database. This work proposes extending the nanopublication model to include a fourth component called knowledge provenance. Knowledge provenance captures the context where an assertion is not derived from a single publication but from a body of knowledge that can comprehend supporting and conflicting pieces of evidence that we need to track and refer to. We apply the defined model to the facts generated by the Collaborative Oriented Relation Extraction (CORE) and published 197, 511 assertions in the form of extended nanopublications, allowing the identification, representation, access, and citation of individual gene expression-cancer associations.

eol>Nanopublications Knowledge Provenance Data Provenance Knowledge Bases Gene-Cancer Associations

1. Introduction

Given the high volume of publications, scientific evidence is often extracted automatically and organized into Knowledge Bases (KBs), which are widely applicable because they are understandable to humans and machines [ 1, 2 ]. Making sure each statement is accessible on its own is key to create a unified resource that contains all the relevant knowledge in a specific field. This approach allows for easy retrieval, access, and reference to specific data points we mention. To accomplish this goal, the nanopublication model proves particularly efective as it enables the identification, representation, access, and citation of individual assertions [ 3, 4]. This model sees extensive use in representing statements, especially in the life science domain [5, 6, 7]. The structure of the nanopublication model consists of three named graphs, each containing information about the assertion, its provenance, and details about the nanopublication itself. While the nanopublication model is well-suited for single assertions originating from a single source of evidence, the provenance graph in the nanopublication model specifically addresses what is termed "Information Provenance." However, when dealing with information derived from the aggregation of multiple sources, challenges arise. In such cases, supporting and conflicting evidence exists, and each assertion may potentially be associated with a reliability score computed in diverse ways. Hence, there is a need to keep track of the provenance of each piece of information used to generate assertions by systems exploiting an aggregation of multiple sources of evidence.

This limitation of the nanopublication model is more evident when considering the assertions of a large-scale knowledge discovery platform storing more than 230K gene expression-cancer associations: CoreKB [8]. 1 CoreKB stores facts generated by the CORE system, which undertakes biomedical literature and extracts detailed aspects from various evidence sources to produce scientific facts suitable for publication each as a Gene Cancer Status ( GCS). Creating a GCS involves extracting information about a gene expression related to a specific disease from numerous sentences in multiple articles. Thus, supporting and conflicting evidence is aggregated to determine the most probable scientific assertion related to the pool of sentences [ 9]. Each fact generated by CORE can be published as a nanopublication following the standard model [10]. However, the provenance graph of the current nanopublication model does not provide enough information to link the GCS and the role of each evidence sentence from the literature. In this work, we extend the nanopublication model to account for multi-source assertions by introducing a novel component to its structure called knowledge provenance. The knowledge provenance graph describes which pieces of information contributed to support or confute the considered assertion. In addition, it includes information about the reliability of each assertion based on each source of information and its extraction process. It is important to note that in the proposed model, the components of the original nanopublication model remain unchanged, allowing for backward compatibility. We model knowledge provenance as an additional named graph of a nanopublication, defined according to an appropriate ontology. In this regard, we define the PROV-K ontology 2, a general resource to represent the provenance information of assertions derived from multiple sources of evidence. The PROV-K ontology is an integration of the PROV Ontology (PROV-O) and is grounded in the literature defining knowledge provenance [11, 12, 13]. To show the applicability of the proposed model, we serialize all facts in CoreKB as extended nanopublications. We published 197, 511 extended nanopublications representing the facts in CoreKB, which can be browsed in the CoreKB platform and downloaded separately from the same platform or in bulk in Zenodo [14]. We also release the source code 3 for building the extended nanopublications, which can be used as a template for future applications on diferent resources.

The rest of this work is organized as follows. Section 2 introduces the original nanopublication model and describes previous eforts in data, information, and knowledge provenance. Section 4 presents an in-use and large-scale knowledge discovery platform highlighting the limitations of the current nanopublication model when dealing with multi-source assertions. Section 3 defines the extended nanopublication model accounting for knowledge provenance. Section

1https://gda.dei.unipd.it/

2Publicly available here: https://prov-k.dei.unipd.it/ontology/ 3https://github.com/mntlra/knowledgeProvenance 4 describes the serialization of extended nanopublications starting from the facts in CoreKB. Section 5 draws some final remarks.

2. Related Work

The nanopublication model aims to facilitate the integration, exchange, accessibility and comprehension of scientific statements, and enable citations at the granularity of individual claims [ 3, 4]. Following this framework, a scientific publication can be divided into single statements or assertions, with each assertion encapsulated in a distinct nanopublication containing all pertinent information about that specific claim. Using Semantic Web technologies, the nanopublication model represents scientific claims in a distinctive, identifiable, citable, and reusable format. Representing data as nanopublications enhances data-intensive science and allows for fact discovery exploiting machine-readable information [15]. From a technical viewpoint, a nanopublication is a named graph that comprises three basic components; each represented as a named graph itself: (i) the assertion graph, containing the scientific assertion; (ii) the provenance graph, containing information about where the assertion comes from and how it has been defined; (iii) the publication info graph, containing all the metadata of the nanopublication, such as who curated it and when it was created. The components of the nanopublication are interconnected using a fourth graph called the head graph.

The nanopublication model has been used to represent statements from diferent fields, especially in the life science domain. Chichester et al. [16] created nanopublications from scientific facts associated with more than 38K proteins stored in the neXtProt database. 4 This approach showed that using the nanopublication model for the neXtProt database eases access to its information and can be a useful tool for expanding biological research [5]. Queralt-Rosinach et al. [7] published the contents of the DisGeNET database 5 as nanopublications to provide a Linked Data resource. Waagmeester et al., in [6], described their endeavors in converting WikiPathways, an online collaborative pathway resource, into nanopublications. 6 Overall, there are more than 10M nanopublications publicly accessible worldwide [17]. Concerning the aggregation of multiple nanopublications, Bucur et al. [18] proposed an approach where nanopublications representing snippets of scientific articles related to the same publication are interlinked, utilizing properties like refersTo. Albeit the unifying model proposed in [18] is relevant to our study, it still does not consider the reliability of an assertion and the supporting or conflicting relationships between pieces of information. The concept of nanopublications has already been expanded in [19]. Here, the assertion graph has been extended to account for English sentences representing textual scientific claims following a semantic scheme called AIDA (Atomic, Independent, Declarative, Absolute). However, we are interested in machine-readable representations, like the nanopublication model.

In the era of truth discovery algorithms and automatic information extraction, the nanopublication model fails to represent data reliability and the provenance of assertions constituted by an ensemble of contrasting and supporting evidence. In this regard, Clark et al. [20] formalized the

4https://www.nextprot.org/

5https://www.disgenet.org/rdf 6https://github.com/wikipathways/nanopublications micropublication model, which represents empirical evidence beyond statement-based models like nanopublications. The proposed model ofers a representation of biomedical evidence with particular interest in building claim networks and their lineage. Although related, the work by Clark et al. only targets the modeling of the biomedical communications ecosystem, including reproducibility and verifiability in research – which is out of scope for this study. Besides, the micropublication model represents the claim of a statement in textual form, as it happens also for AIDA nanopublications [19].

The Data–Information–Knowledge–Wisdom (DIKW) pyramid is a widely recognized model for representing information and knowledge within management systems [21]. It describes the processes involved in the data transformation, from a piece of data to the wisdom embedded in it. Each step adds value to the final results, starting from raw Data, where one can extract Information, to Wisdom, that is the application of Knowledge acquired from the information block. We establish a connection between the DIKW pyramid and provenance. In earlier studies, the initial level, known as Data Provenance, has received extensive attention within databases. Its primary emphasis lies in tracing the data lineage in response to a query [22, 23]. In this context, Provenance encompasses the origin and the pathway through which a specific piece of data was introduced into the given database. Over the years, various conceptualizations of provenance have been proposed and explored, such as “why-provenance,” “where-provenance,” and “how-provenance” [22, 23].

The second level concerns Information Provenance, which represents the provenance of assertions inferred from data. This is embedded in the provenance graph of the nanopublication model and has been studied in the context of the Semantic Web. Provenance on the Semantic Web comprises metadata representing the creation and publication of resources. The PROV Ontology (PROV-O) 7 provides a formal language to encode provenance information in a machine-readable format. It is based on the PROV Data Model (PROV-DM) 8 and the Open Provenance Model (OPM) [ 24 ]. While extensive in scope, the PROV-O models provenance as in the provenance graph of the nanopublication model; therefore, it lacks the representation of supporting and contradicting evidence and reliability scores.

The third level, called Knowledge Provenance, is the focus of this work, and it has been studied in diferent works by Fox and Huang [ 25, 12, 13, 26]. Knowledge Provenance (KP) has been proposed to create an approach to annotate the reliability of information extracted from web sources based on who created the assertion, how much the creator can be trusted, and what the information depends on. Little work has been done towards this end, and it mostly focuses on providing a taxonomy of four levels of provenance based on the certainty degree of each assertion [25]: Static KP (Level 1) for assertions for which the truth value does not change over time [25]; Dynamic KP (Level 2) allowing the validity of information to change over time [12]; Uncertainty-oriented KP (Level 3) considering truth values and relationships that are uncertain [13]; Judgment-based KP (Level 4) for provenance supported by social processes, e.g., truth propagation in social networks [26]. The Static Knowledge Provenance Ontology defines a taxonomy of proposition types and a set of axioms allowing the development of a reasoner to assess truth values based on diferent cases [ 11]. Although the ontology has been formalized 7http://www.w3.org/TR/2013/REC-prov-o-20130430/ 8https://www.w3.org/TR/prov-dm/ in [11], it is not available as a resource.

The fourth level, Wisdom Provenance, focuses on keeping track of the provenance of the wisdom inferred from knowledge or applications exploiting such knowledge, which is still unexplored in the literature.

3. Extended Nanopublication Model

We extend the original nanopublication model by introducing a novel component named knowledge provenance, which allows for the tracking of provenance for multi-sourced assertions. In this context, “knowledge provenance” is the nanopublication component describing which pieces of information contributed to support or confute the assertion. It can also include information about the reliability of the assertion and its assigned certainty degree. It is important to note that our approach does not change the other components of the original nanopublication model.

Since each module of a nanopublication is a named graph, we model knowledge provenance as a named graph according to an ontology called the PROV-K ontology. 9 We designed the 9The PROV-K ontology and its complete documentation are available at https://prov-k.dei.unipd.it/ontology/ PROV-K ontology by extending PROV-O 10 to represent provenance information of assertions derived from multiple sources using an aggregation algorithm. Note that the PROV-K ontology is a standalone resource, meaning it can also be used independently of nanopublications to represent provenance. We developed the PROV-K ontology following the guidelines provided by the Static KP ontology [11] with the addition of some elements from Dynamic KP, such as timestamps for truth values and trust relationships [12]. We also incorporate the concepts of assigned certainty degree and certainty degree from Uncertainty-oriented KP [13]. Representing truth values as probability distributions is more meaningful for our study than the static KP assumption, where the truth value is a categorical variable. We also expanded the concepts from Uncertainty-oriented KP to represent reliability testing. In this way, we can classify assertions into reliable or unreliable facts and include the conditions an assertion may fail to satisfy. We describe the PROV-K ontology based on four main areas: Propositions, Digital Signature and Information Sources, Trust Relationships, and Truth Value. Figure 1 reports the ontology schema.

Proposition. The central unit of the PROV-K ontology is the “Proposition”, which is defined as “the smallest piece of information to which provenance-related attributes may be ascribed” and as “a declarative sentence that is either true or false” [11]. In our context, the nanopublication model’s assertion graph can be considered the proposition. Since the PROV-K ontology extends PROV-O, we model propositions as subclasses of prov:Entity. We also define a taxonomy of propositions based on [11], which diferentiates between independent and dependent propositions, i.e., assertions whose truth value depends upon other propositions. Each proposition can be supported by or conflicting with other knowledge sources. To encompass this situation, we defined two object properties called supportedBy and conflictingWith, both with range class “Sentence” from the Semanticscience Integrated Ontology (SIO). 11 Each proposition can be linked to one or more knowledge fields with the object property dc:subject from the Dublin Core (DC) Metadata Items. 12 Information Source and Signature. To determine the provenance of a proposition, it may be useful to represent the document in which the considered proposition appears. We link the proposition to the “document” it belongs to with the object property prov:wasDerivedFrom from PROV-O. In this way, we allow a proposition to belong to any entity, e.g., a textual document or a dataset. For any document and any proposition, one can define its creator with the object property dc:creator from the DC Metadata Items, with range class prov:Agent from PROV-O. In this way, the creator of a proposition or a document may be any agent, e.g., a person or a digital artifact. We also represent the digital signature and signature status that can be assigned to a proposition. We apply the PROV-K ontology to model knowledge provenance in the context of nanopublications. However, the PROV-K ontology is a general resource designed to track provenance information for aggregated sources of evidence beyond the knowledge provenance graph of the nanopublication model. Thus, we defined the “Signature" class within 10https://www.w3.org/TR/prov-o/ 11http://semanticscience.org/resource/SIO_000113 12http://purl.org/dc/terms/subject

Assigned Truth Value PROV-K:AssignedTruthValue rdfs:SubClassOf

Reliable Fact

PROV-K:ReliableFact

Unreliable Fact PROV-K:UnreliableFact

PROV-K:unmetCondition PROV-K:unreliabilityReason xsd:string

Reliability Condition

PROV-K:ReliabilityCondition rdfs:SubClassOf

Insufficient Evidence PROV-K:InsufficientEvidence

Contrasting Evidence PROV-K:ContrastingEvidence

PROV-K:conditionThreshold PROV-K:conditionScore xsd:float xsd:float rdfs:SubClassOf

PROV-K:sufficiencyCriteria

Sufficiency Condition PROV-K:SufficiencyCondition

Sufficiency Criteria

PROV-K:SufficiencyCriteria

PROV-K:consistencyCriteria PROCVo-Kns:CisotennscisyteCnocnydCiotionndition PROCVo-Kns:CisotennscisyteCnrcityeCriraiteria

PROV-K Taxonomy PROV-K Taxonomy the ontology rather than solely relying on the modeling provided by the nanopublication model. Nevertheless, when we apply such an ontology to the nanopublication model, we can represent digital signatures with the “Nanopub Signature Element” 13 class from nanopubx. Trust Relationships. We identify two trust relationships, one between two agents, which are referred to as provenance requester and information creator (class “InfoCreatorTrust"), and another between an agent and a proposition (class “PropositionTrust"). The former is defined as “the provenance requester a “trusts” information creator c in a specific knowledge efild f”, where “trust” means “a believes any proposition created by c in field f to be true”. The latter takes the form of “Proposition x is trusted by an agent a” [11]. We also include the timestamp reporting when the trust relationship was issued using the data property prov:atTime from PROV-O. This work focuses on tracking the provenance of each piece of evidence and determining the reliability of a given fact. Nevertheless, the PROV-K ontology can be easily expanded to account for more complex trust relationships and decision processes.

Truth Values. Fox and Huang defined two types of truth values: the assigned truth value and the trusted truth value [11]. The former refers to the truth value assigned to the proposition by its creator. At the same time, the latter identifies the truth value evaluated by an external agent called “provenance requester”. In Static KP, the truth value of a proposition can be “True”, “False”, or “Unknown”. Thus, we link to each proposition the class TruthValue with object property hasTruthValue, where one can store both the assigned and trusted truth values defined in [11]. We also represent the timestamp reporting when the truth value was assigned or trusted with the data property prov:atTime from PROV-O. We define the assigned certainty degree as the probability that the proposition’s creator assigns the truth value of “True” to the proposition and the certainty degree as the probability that an agent evaluates the trusted truth value as 13http://purl.org/nanopub/x/NanopubSignatureElement “True” [13]. Based on the assigned certainty degree, we may classify propositions as reliable or unreliable. For this reason, we expand the concept of “assigned truth value” to account for reliability testing. We report the ontology schema for the assigned truth value in Figure 2. We classify the assigned truth value as “Reliable Fact” or “Unreliable Fact”, where the latter identifies propositions failing some pre-defined reliability tests. One can represent whether the proposition is unreliable due to insuficient evidence (subclass InsufficientEvidence) or due to contrasting evidence (subclass ContrastingEvidence). We also include a data property called unreliabilityReason to describe why the proposition is deemed as “unreliable”, as well as an object property called unmetCondition to specify the reliability conditions the proposition fails to respect. Each condition has a score, a threshold, and a criteria which describes how the reliability condition works. For instance, in CORE we have two suficiency and one consistency criteria, which are modeled as named individuals of type skos:Concept and either SufficiencyCriteria or ConsistencyCriteria.

4. Real-World Use Case

CoreKB. CORE is a Knowledge Base Construction (KBC) system based on the combination of ML-based models and domain-expert feedback [9]. CORE harvests text from the literature, identifies sentences containing pairs of relevant entities, and extracts fine-grained aspects from them to generate gene expression-cancer associations that can be published as facts – i.e., GCS. For each fact, CORE combines three aspect probabilities to assign the gene class likelihood to the three mutually exclusive gene classes: oncogene, tumor suppressor gene, and biomarker. Then, the system performs a two-stage reliability test that, for each fact, first verifies that the fact has suficient evidence and subsequently checks that mutually exclusive classes are not similarly probable, i.e. it assesses the degree of contradicting evidence. In this way, unreliable facts can be fed back to domain experts for manual annotations in an active learning paradigm, making CORE suitable to iterative KB versioning. For technical details and the evaluation of the CORE system, we resort the interested reader to [9].

The data extracted by CORE, and then ingested by CoreKB, contains information about 23,879 genes and 11,530 diseases for a total of more than 230K fine-grained facts supported by 1,037,845 sentences from 251,038 research articles. Figure 3 shows a GCS card displayed by the CoreKB platform Each GCS comprises information about the gene and disease involved, which are identified by National Center for Biotechnology Information ( NCBI) Gene IDs and Unified Medical Language System (UMLS) Concept Unique Identifiers ( CUIs) respectively, together with the assigned gene class. In addition, each GCS is linked to the sentences supporting the fact, i.e. identifying the same gene class, and those conflicting with it. For each sentence, provenance information includes the PubMed ID of the article from which the sentence has been extracted and the year of publication of such an article. CoreKB comprises three types of GCS: reliable facts, unreliable facts due to insuficient evidence, and unreliable facts due to low consensus (contrasting evidence). The former are facts that passed the reliability tests performed by CORE, while the others are facts that failed any of the two checks performed by the testing component.

Facts generated by CORE can be published as nanopublications since each GCS can be viewed as an assertion graph on its own. Giachelle et al. [10] showed the serialization of a

GCS in CoreKB following the standard nanopublication model. In summary, the assertion graph comprises the GCS itself in the form of an Resource Description Framework (RDF) graph, the publication info graph includes metadata about the nanopublication, and the provenance graph describes how the assertion was derived. However, by publishing facts within CoreKB as classic nanopublications we cannot represent supporting and conflicting sentences and embed information about their reliability. For instance, consider the GCS in Figure 3, the standard nanopublication model fails to represent that the assertion is derived from the ensemble of 83 supporting sentences and 41 conflicting sentences. Moreover, it cannot include information about the reliability of such an assertion, i.e., it cannot convey the information that the GCS is reliable as the probability that gene “AKT1” is an oncogene for “bladder neoplasm” is 0.82. Creation of the extended nanopublications. The extended nanopublication model has been applied to serialize all the facts in CoreKB as nanopublications with knowledge provenance. We followed the same methodology for the original nanopublication components used for the DisGeNET nanopublications [7]. Figure 4 shows a GCS modeled with the extended nanopublication model and serialized in TriG format. 14

A nanopublication representing the facts generated by CORE comprises five named graphs: head, assertion, provenance, publication information, and knowledge provenance. The head graph connects all the components by linking the nanopublication URI to its subgraphs. The assertion graph contains the GCS in RDF format. The provenance graph includes the information prove14Access at: https://gda.dei.unipd.it/cecore/resource/nanopub/3262e08b519b5e61b244c7e42c001d98/.

@prefix cegcs: <http://gda.dei.unipd.it/cecore/resource/GCS#> . @prefix cesent: <http://gda.dei.unipd.it/cecore/resource/Sentence#> . @prefix corekp: <http://gda.dei.unipd.it/cecore/resource/nanopub/PROV-K/> . … @prefix PROV-K: <ttps://w3id.org/PROV-K/ontology/schema/> . sub:head { this: a np:Nanopublication ; np:hasAssertion sub:assertion ; np:hasProvenance sub:provenance ; np:hasPublicationInfo sub:publicationInfo ; np:hasKnowledgeProv sub:knowledgeProv . } sub:assertion { … } sub:provenance { … } sub:publicationInfo { … } sub:knowledgeprov { sub:assertion PROV-K:hasTruthValue corekp:3262e08b519b5e61b244c7e42c001d98 ;

PROV-K:conflictingWith cesent:0aab5b916e37fed4e9a2a48104af848d, … cesent:f3dfdef16032046ce21e3bb7eebfabe9 ; PROV-K:supportedBy cesent:008a43c727cb5c31a071c3161e1b246f, … cesent:fed4a67bd6d534926a7f5c152e3aa827 . corekp:3262e08b519b5e61b244c7e42c001d98 a PROV-K:ReliableFact ;

PROV-K:assignedCertaintyDegree "0.81542736"^^xsd:float ;

PROV-K:assignedCertaintyDegreeSupport 83 . } nance and evidence used to build the GCS. In our case, all facts are derived from CoreKB and are generated automatically (class “Automatic Assertion"). 15 Since the facts generated by CORE often integrate more than one source of evidence (i.e., sentences), the source evidence is an instance of class “Combinatorial Evidence" from the Evidence and Conclusion Ontology (ECO). 16 The publication information graph includes the general topic of the nanopublications, information about the authors of the nanopublications, and the used dataset. Since CoreKB comprises ifne-grained gene expression-cancer associations, the general topic for all nanopublications is “gene-disease association linked with altered gene expression” from SIO. 17 The knowledge provenance graph includes all supporting and conflicting sentences and information about the reliability of the GCS. For instance, the GCS in Figure 4 is a reliable fact. Hence, its truth value 15http://purl.obolibrary.org/obo/ECO_0000203 16http://purl.obolibrary.org/obo/ECO_0000212 17http://semanticscience.org/resource/SIO_001123 is an instance of class “ReliableFact" and we report its assigned certainty degree and support.

To represent the assertion graph, we rely on the ontology underlying KBs generated by CORE, 18 while for the provenance graph we rely on PROV-O. 19 For the authorship and versioning, we employ the Provenance, Authoring, and Versioning (PAV) vocabulary [27], and for the description of the used datasets we employ the Provenance Vocabulary Core ontology Specification ( PRV) [28]. The evidence annotation is described using the Weighted Evidence (WI) vocabulary, 20 which comprises the object property wi:evidence to link the assertion to its evidence, and the ECO ontology. 21 For the description of the topic of the nanopublications and the process used to build the assertion, we use the SIO ontology. 22 For the knowledge provenance graph, we extended the PROV-K ontology to include the reliability condition defined by the CORE system. Specifically, we added two suficiency criteria and one consistency criterion. A fact generated by CORE passes the suficiency checks if the probability of Change of Cancer Status (CCS) and Gene-Cancer Interaction (GCI) being not informative is below a threshold value set to 0.7. The consistency test instead checks whether the diference between the probabilities of the fact being classified with the two gene classes with the highest likelihood is bigger than a threshold value set to 0.4.

To build the extended nanopublications, we extended the Python package nanopub. 23 We kept the provenance, publication information, and assertion graph unchanged to provide backward compatibility with the original nanopublication model. In addition, we developed a Python package to publish the facts in CoreKB as extended nanopublications serialized in TriG syntax. The code can take as input two CSV files comprising the facts and the sentences supporting or conflicting with it, or one can provide a Turtle (.ttl) file comprising the CoreKB dump available in Zenodo [29]. The code for serializing the facts in CoreKB as extended nanopublications can also serve as a template for future applications on diferent resources.

CoreKB comprises 231,099 GCS, which can be divided into reliable facts, unreliable due to insuficient evidence, and unreliable due to low consensus. We filter out unreliable facts due to insuficient evidence, as publishing them as independent publications provides little to no information. As a result, we published 197,511 facts from CoreKB as extended nanopublications, accounting for 156,172 reliable facts and 41,339 unreliable ones due to contrasting 18http://gda.dei.unipd.it/cecore/ontology/ 19http://www.w3.org/TR/prov-o/ 20http://www.evidenceontology.org/ 21https://ontobee.org/ontology/ECO 22https://ontobee.org/ontology/SIO 23https://github.com/fair-workflows/nanopub evidence. Table 1 shows the gene class distribution of the facts in CoreKB serialized as extended nanopublications. The extended nanopublications are also available in Zenodo [14].

We include serialized nanopublications into the CoreKB platform to ease facts visualization. For each GCS, one can explore the serialized nanopublication by clicking on the eye icon placed in the drop-down list on the right side of the claim (see point B in Figure 3). 24 The visualization depicts each component with a diferent color and displays URIs redirected to a functioning website containing the description of the considered element. One can also download the extended nanopublication representing a specific GCS thanks to the download button (see point A in Figure 3).

5. Final Remarks

This work extends the current nanopublication model to include a novel component called knowledge provenance, accounting for the provenance information of assertions derived from the aggregation of multiple sources of evidence. We described knowledge provenance as a named graph tracking the provenance of each piece of information that contributes to support or confute the assertion. To support the semantics of the knowledge provenance graph, we designed the PROV-K ontology, an integration of PROV-O representing provenance information of assertions derived from the aggregation of multiple sources. The PROV-K ontology is a general resource designed to track provenance information for aggregated sources of evidence beyond the context of nanopublications. We applied the proposed model by serializing more than 197K facts in CoreKB and publishing them as extended nanopublications. Such nanopublications can be easily browsed and downloaded through the CoreKB platform. The serialization of facts in CoreKB can also serve as a template for applying the extended nanopublication model on diferent resources.

Acknowledgments

This project has received funding from the HEREDITARY Project, as part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074. August 2005, Copenhagen, Denmark, IEEE Computer Society, 2005, pp. 524–528. URL: https://doi.org/10.1109/DEXA.2005.193. [27] P. Ciccarese, S. Soiland-Reyes, K. Belhajjame, A. J. G. Gray, C. A. Goble, T. Clark, PAV ontology: provenance, authoring and versioning, J. Biomed. Semant. 4 (2013) 37. URL: https://doi.org/10.1186/2041-1480-4-37. [28] O. Hartig, J. Zhao, Publishing and consuming provenance metadata on the web of linked data, in: Proc. of the Provenance and Annotation of Data and Processes - Third International Provenance and Annotation Workshop, IPAW 2010, Troy, NY, USA, June 15-16, 2010. Revised Selected Papers, volume 6378 of Lecture Notes in Computer Science, Springer, 2010, pp. 78–90. URL: https://doi.org/10.1007/978-3-642-17819-1_10. [29] S. Marchesin, L. Menotti, G. Silvello, O. Alonso, CORE: Gene Expression-Cancer Knowledge Base, Zenodo, 2023. URL: https://doi.org/10.5281/zenodo.7577127.

[1]

Weikum ,

X. L.

Dong ,

Razniewski ,

F. M.

Suchanek , Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases, Found . Trends Databases 10 ( 2021 ) 108 - 490 . URL: https://doi.org/10.1561/1900000064.

[2]

X. L.

Dong , Generations of knowledge graphs: The crazy ideas and the business impact , Proc. VLDB Endow . 16 ( 2023 ) 4130 - 4137 . URL: https://doi.org/10.14778/3611540.3611636.

24The serialized nanopublication representing the GCS used as an example throughout the paper can be visualized