A model for capturing provenance of assertions
                 about chemical substances

             Kody Moodley1 , Amrapali Zaveri1 , Chunlei Wu2 , and Michel
                           Dumontier1[0000−0003−4727−9435]
    1
          Institute of Data Science, Maastricht University, Universiteitsingel 60, 6229 ER,
                                   Maastricht, The Netherlands
                         firstname.lastname@maastrichtuniversity.nl
        2
           Department of Molecular and Experimental Medicine, The Scripps Research
                Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA
                                        cwu@scripps.edu


          Abstract. Chemical substance resources on the Web are often made ac-
          cessible to researchers through public APIs (Application Programming In-
          terfaces). A significant problem of missing provenance information arises
          when extracting and integrating data in such APIs. Even when provenance
          is stated, it is usually not done with any prescribed templates or termi-
          nology. This creates a burden on data producers and makes it challenging
          for API developers to automatically extract and analyse this information.
          Downstream, these consequences hinder efforts to automatically deter-
          mine the veracity and quality of extracted data, critical for proving the in-
          tegrity of associated research findings. In this paper, we propose a model
          for capturing provenance of assertions about chemical substances by sys-
          tematically analyzing three sources: (i) Nanopublications, (ii) Wikidata
          and (iii) selected Minimal Information Standards (MISTS) for reporting
          biomedical studies3 . We analyse provenance terms used in these sources
          along with their frequency of use and synthesize our findings into a pre-
          liminary model for capturing provenance.

          Keywords: API · provenance · evidence · data model · chemical substance


1       Introduction
The increasing number of chemical substance databases on the Web are often
made accessible through public APIs (Application Programming Interfaces),
queryable by researchers to enrich their computational analyses and scientific
workflows. One such API is the BioThings API suite4 . “BioThings” refer to
objects of any biomedical entity-type represented in the biological knowledge
space, such as genes, drugs, chemicals, diseases, etc. The popular MyChem.info
(chemicals), MyGene.info (genes) and MyVariant.info (gene variants) APIs are
demonstrable examples built and maintained using the SDK.
3
    Reported in FAIRsharing.org https://fairsharing.org
4
    http://biothings.io/
2      K. Moodley et al.

    The problem of provenance for scientific assertions arises for stakeholders of
such APIs. We adopt the W3C’s definition of provenance: ”Provenance is infor-
mation about entities, activities, and people involved in producing a piece of
data or thing, which can be used to form assessments about its quality, reliability
or trustworthiness”5 . An assertion refers to an individual statement about a par-
ticular entity, in this case, a chemical entity. For example, in PubChem it states
“Acetaminophen has a melting point of 168 degrees celsius”6 . In practice, this
assertion might be encoded in a concrete data format, e.g. in JSON as a key-value
pair, or in RDF (Resource Description Framework)7 as a subject-predicate-object
triple. In PubChem, two references are listed for this assertion: the publication
reporting it, and a database from which it was retrieved. Both items indicate
provenance, though provenance is not limited to these types. Provenance also
includes evidence supporting the assertion (in the form of specific data elements,
media, graphs etc.) and the experimental methodologies that generate it.
    Data producers, for various sources that an API integrates, often do not
submit sufficient provenance information for assertions, or, when they do, they
often use inconsistent terminology. This inconsistent capturing is not helped by
the lack of a standardized specification for doing so. The problem also makes
it challenging for API developers to represent provenance in the results of
submitted queries, in a machine-interpretable way. While there have been at-
tempts to develop vocabularies for capturing provenance, there is no accepted
guideline for identifying which provenance items are essential (must be speci-
fied), recommended (should be specified) and optional. Another missing feature
is a universally accepted standard for which vocabularies to use when speci-
fying provenance about chemical assertions. For example, Wikidata [14] uses
“stated in” (encoded as P248) to refer to a database from which an assertion
was extracted. However, Nanopublications (Nanopubs) [10] have a variety of
possible terms to describe the same item, including: “hasSource”, “references”
and “cites”. Downstream, this complicates automatic identification, extraction
and processing of provenance by developers, which is crucial for validating
research findings associated with given assertions.
    To address these problems, we systematically analyse three sources: (i)
Nanopublications, (ii) Wikidata and (iii) selected MISTS for reporting biomed-
ical studies reported in FAIRsharing.org, to examine how they capture prove-
nance. We then synthesise our findings to propose a preliminary model for
capturing provenance of assertions about chemical substances. Our motivation
for studying Nanopubs and Wikidata in particular is that they are the only
large-scale databases of scientific assertions providing mechanisms for specify-
ing machine-processable provenance information that is mapped to standard
ontology terms. MISTS are studied because they are the only known recom-
mendations of minimal information (including provenance of study assertions)
required for describing biomedical studies.

 5
   https://www.w3.org/TR/prov-overview
 6
   https://pubchem.ncbi.nlm.nih.gov/compound/1983#section=Melting-Point
 7
   https://www.w3.org/RDF/
     A model for capturing provenance of assertions about chemical substances              3

2    Related Work
There have been many efforts at standardizing general data provenance on the
Web [6,13,5]. Provenance of datasets in life sciences has also received attention
with the BioSchemas [4], DATA tag suite [11] and HCLS [3] initiatives. However,
there are currently no specialized models for chemical substance assertions.
    In terms of scientific assertions, the most relevant initiative providing a par-
tial specification, is Wikidata8 . Wikidata is an online open knowledge repository
storing structured data about information on a wide variety of topics, including
chemicals. In Wikidata’s provenance model, an assertion is called a “claim”. An
example of a claim is“Acetaminophen is a subclass of non-opioid analgesic”.
The entities “Acetaminophen” and “non-opioid analgesic”, and the “subclass”
relation, are referred to by Wikidata-specific identifiers Q57055, Q1747785 and
P279 respectively. Each claim has a list of “references” which are each a sepa-
rate record of provenance for the claim. Each reference can specify an arbitrary
number of provenance items supporting the claim (e.g. database, publication
or date retrieved). Wikidata’s provenance model is not sufficient for our goals
since it does not indicate which provenance items are essential, recommended
and optional. It also uses Wikidata-specific terms for provenance which are not
mapped to standard (bio-)ontology terms to increase their interpretability.
    There have also been various efforts to standardize terminology that data
producers can use to capture provenance [7,2,9,12,1]. We briefly describe three
prominent terminologies below.

Open PHACTS & The W3C Provenance working group. Open PHACTS [15] is
an online drug discovery platform integrating, linking, and providing access
to data across numerous biomedical resources. To establish a standardized
and interoperable way for exchanging provenance information, Open PHACTS
participated in the activities of the W3C Provenance working group to influ-
ence the development of such a standard. A major output of the W3C Prove-
nance working group is the Provenance Ontology (PROV-O) [8], a domain-
independent RDF terminology for describing provenance information. PROV-
O consists of terms to denote either physical or conceptual entities of inter-
est, as well as property terms to denote provenance-related relationships be-
tween such entities [8, Figure 1]. For example, the assertion: ”Rifampin is ef-
fective in the treatment of Pulmonary Tuberculosis” is represented in RDF as:
<http://purl.obolibrary.org/obo/CHEBI 28077> <http://purl.obolibrary.org/obo/RO 0002606>
<http://purl.bioontology.org/ontology/SNOMEDCT/154283005>. To associate provenance with
this statement, such as the publication in which it was proposed, one can use
prov:hadPrimarySource9 or prov:wasQuotedFrom to specify the URL or DOI
(Digital Object Identifier) for the article in which the assertion originates.

Provenance, Authoring & Versioning (PAV) ontology. The PAV ontology10 ad-
dresses some limitations of PROV-O with regards to authoring and version-
 8
   http://Wikidata.org
 9
   prov is prefix for http://www.w3.org/ns/prov#
10
   https://pav-ontology.github.io/pav
4         K. Moodley et al.

ing of digital resources on the Web. In particular, PROV-O is not expressive
enough to capture specialised authoring roles such as “contributor” and “cu-
rator”. This finer-grained approach also extends to properties associated with
these roles. The ontology was designed to be light-weight (containing as few
terms as possible for satisfying the requirements) which makes it useful only as
a complementary terminology, in combination with others like PROV-O, to fit
use-cases in specialized knowledge domains.

Chemical Information (CHEMINF) Ontology. The Chemical Information Ontol-
ogy11 was established to structure information and standardize terminology in
modern chemical research. This kind of research regularly requires various sim-
ulations and calculations to be performed on chemical data. CHEMINF provides
a standardized vocabulary for annotating the various calculations obtained by
the software, which is crucial for diagnosing errors and determining their verac-
ity. The terms in CHEMINF are broadly classified into those related to: software,
the algorithms implemented by them, and properties of chemical substances.


3     Provenance usage in Nanopubs, Wikidata and MISTS

3.1    Nanopublications

Building on the RDF standard for representing assertions on the Web, Barend
Mons and Jan Velterop proposed the concept of Nanopublication in 2009 [10].
The motivation was to enable researchers to publish small structured snippets
of knowledge (Nanopublications or nanopubs for short) from research data
complementing the publishing of traditional full-length research texts. Indeed
Mons and Velterop argue that: ”Published contributions to science can be as
short as single statements that interpret data, and yet be valuable to scientific
progress and understanding” [10, Section 2.4]. A nanopub consists of a core as-
sertion represented as an RDF triple, richly annotated with qualifying metadata
and provenance for the assertion. Since all language features required to repre-
sent nanopubs are included in RDF, the nanopub specification is essentially an
RDF design pattern for expressing assertions and their associated metadata. We
queried all Nanopublications12 (the latest dump on 5 April 2018 which includes
10,803,231 nanopubs) to retrieve all properties associated with them, along with
their usage frequencies. A total of 333 properties were extracted. We then man-
ually pruned this list of properties to retain only those that were related to
provenance of an assertion, which resulted in 37 unique properties. From this
pruned list, we found that the most frequently used properties are “hasPub-
licationInfo”, “hasEvidence”, “date of assertion” and “author”. Figure 1 plots
usage frequencies (logarithmic scale) of each provenance item.
    We further classified the properties into five major dimensions:
11
     https://www.ebi.ac.uk/ols/ontologies/cheminf
12
     https://zenodo.org/record/1213293#.W6-WwxMzaAw
      A model for capturing provenance of assertions about chemical substances   5


             Fig. 1. Usage frequency of each provenance item in Nanopubs.


 – Authorship refers to information about the person(s) or organisational enti-
   ties responsible for generating the assertion
 – Temporal aspects refer to information concerning when the assertion was
   made, modified etc.
 – Source refers to information about the publication or database from which
   the assertion was extracted.
 – Generating process concerns information about the experiments, studies and
   analytical procedures which led to the proposal of the assertion.
 – Evidence refers to individual datasets, pieces of data and other assertions
   which support or prove the given assertion.
Figure 3 plots the usage frequency (logarithmic scale) of Nanopublication prove-
nance properties in these different dimensions.

3.2    Wikidata
We performed a similar analysis of claims in Wikidata. We extracted c.a. 150, 000
records of type “Chemical Compound” from Wikidata. For all claims about these
compounds that had references, we counted the usage frequency of each type
of provenance property (see Figure 2).
    As a result of our analysis, we retrieved 76 unique metadata items (Figure 2).
We further analysed this list to eliminate domain-specific provenance metadata
items and pruned the list to 37 metadata items. Some examples of non-domain
specific items we found are “statedin”, “publicationdate”, “retrieved”, “soft-
ware version”, “DOI” etc. By examining the frequency across the provenance
dimensions (Figure 3), we noticed that Nanopub authors submit much more
evidence, authorship and temporal information than Wikidata contributors.
6     K. Moodley et al.


          Fig. 2. Usage frequency of each provenance item in Wikidata.


    Fig. 3. Wikidata and Nanopubs provenance usage frequency per dimension.
      A model for capturing provenance of assertions about chemical substances     7

Another finding is that for both Nanopubs and Wikidata, Generating process
information is largely not specified.

3.3    MISTS
We additionally analyzed selected MISTS for reporting biomedical studies that
were registered at FAIRsharing.org, which are ‘Recommended’ and have a
publication (‘Has Publication’). As a result of this query, we retrieved a list of 16
reporting guidelines, of which 2 were duplicates and 1 was unavailable. Thus,
we analysed the following 14 reporting guidelines’ metadata elements:
    – Animals in Research: Reporting In Vivo Experiments
    – STAndards for the Reporting of Diagnostic accuracy
    – STrengthening the Reporting of OBservational studies in Epidemiology
    – Preferred Reporting Items for Systematic reviews and Meta-Analyses
    – CONSOlidated standards of Reporting Trials
    – Minimum Information About a Microarray Experiment
    – Minimal Information Required In the Annotation of Models
    – Minimum Information about a Molecular Interaction Experiment
    – Minimum Information About a Proteomics Experiment
    – Minimum Information about any (x) Sequence
    – Recommended reporting guidelines for life science resources
    – Consolidated criteria for reporting qualitative research
    – Case Reports
    – Consolidated Health Economic Evaluation Reporting Standards
A total of 347 metadata elements were extracted from all of these reported
guidelines. We analyzed each of them and pruned the lists to 44 elements such
as “ethics statement”, “apparatus”, “duration”, “location”.


4     Proposed provenance model
The analysis in Section 3 demonstrates that all extracted provenance properties
can be classified into the 5 dimensions of provenance information: Authorship,
Temporal aspects, Evidence, Generating Process and Source. While there may be
other interesting dimensions of provenance information that data producers do
not yet record, our primary goal is to capture existing information, in a structured
way. Therefore, we use the 5 dimensions as a guide for selecting the relevant
provenance items for our model. For each dimension we also identified certain
subcategories of provenance items relevant to that dimension. The complete
hierarchy is depicted in Figure 4.
    The procedure for selecting the provenance items (and corresponding on-
tology terms) to use for each provenance dimension, was based on two criteria:
1) the frequency of their use by data publishers in Wikidata and Nanopubs,
and 2) their relative importance in determining the veracity of an assertion (as
judged by two postdoctoral researchers with 12 years of combined experience
8        K. Moodley et al.


         Fig. 4. Provenance dimensions for a scientific assertion about chemicals.


in bioinformatics research). For authorship of an assertion we identified three
potentially important pieces of information: the name of the primary author of
the assertion, names of any co-authors of the assertion, as well as names of any
persons who contributed to the discovery or generation of the assertion. The
Source dimension can be broadly separated into properties pertaining to the
scientific publication in which the assertion was reported, and those concerning
a database (potentially an indirect source) from which the assertion was re-
trieved. For Generating process, we divided the properties into those related to
clinical trial studies, and those related to computational analyses. Evidence can
be derived from another (secondary) assertion, a specific data element (such as
images, audio/visual media, graphs, calculations etc.), or a specialized dataset.
Finally, the main temporal properties we associate with an assertion are the
dates on which it was conceived, published and last modified.
     Thus, our proposed provenance model consists of (i) a listing of provenance
properties to be used for assertions about chemical substances, (ii) a precise def-
inition of each property (through mappings to standard ontology terms), and
(iii) a recommendation of which properties are essential, recommended, and
optional, respectively. These are fully detailed in the file “BioThingsProvenance-
Model.xslx” within our GitHub repository13 . Table 1 summarises the essential
properties in each dimension. Each property is assigned an ontology term using
prefixes (resolvable at http://prefix.cc).
     The model is independent of any concrete data format, and is implementable
in the JSON-LD and RDF standards, for example. We provide example instan-
tiations of our model for these two formats in our GitHub repository, located in
files “jsonldexample.json” and “rdfexample.ttl”, respectively.

13
     https://github.com/MaastrichtU-IDS/biothingsprovenancemodel
    A model for capturing provenance of assertions about chemical substances      9

# Name                                 (Sub)Dimension      Ontology term
 1 assertedBy                          Author              dcterms:creator
 2 coAssertedBy                        Co-author           obo:MS 1002036
 3 assertedOn                          Date asserted       prov:generatedAtTime
 4 publishedOn                         Date published      dbpedia:publicationDate
 5 supportedByDataSet                  Dataset             prov:wasDerivedFrom
 6 supportingDatasetVersion            Dataset             schema:version
 7 supportingDatasetLicense            Dataset             schema:license
 8 supportingDatasetURL                Dataset             schema:url
 9 wasDerivedFrom                      Data element        sio:SIO 000772
 10 wasInferredFrom                    Secondary assertion prov:wasDerivedFrom
 11 supportingExperimentID             Experiment / Study schema:identifier
 12 supportingAnalysisSoftware         Analysis            swo:SWO 0000001
 13 supportingAnalysisMethod           Analysis            prov:wasGeneratedBy
 14 supportingAnalysisSoftwareVersion Analysis             sio:SIO 000654
 15 supportingSoftwareLicense          Analysis            schema:license
 16 publishedIn                        Publication         prov:hadPrimarySource
 17 publisher                          Publication         schema:publisher
 18 publicationTitle                   Publication         dbpedia:publicationTitle
 19 retrievedFrom                      Database            pav:retrievedFrom
 20 retrievalURL                       Database            pav:retrievedFrom
 21 databaseURL                        Database            schema:url
 22 databaseLicense                    Database            schema:license
 23 databaseVersion                    Database            schema:version
  Table 1. Summary of the prescribed essential provenance properties in our model.


5   Conclusions, limitations and future work

In this paper, we proposed a model for capturing provenance about chemi-
cal substances when retrieving information via an API across disparate data
sources. We reported on a systematic analysis of three sources concerning how
they report provenance and specifically which non-domain specific provenance
items are advocated based on their frequency of usage. The three sources ana-
lyzed were (i) Nanopublications, (ii) Wikidata, (iii) Selected MISTS for report-
ing biomedical studies (from FAIRsharing.org). Our provenance model consists
of 90 unique properties. As future work, we will implement the model on
10 prominent data sources amalgamated by the BioThings API (specifically
MyChem.info), and evaluate it’s utility. As next steps, we envision the relevance
of selected elements to be discussed widely within the biomedical community,
and that a future version will be recommended for widespread uptake.

6   Acknowledgements

Support for this work was provided by NCATS, through the Biomedical Data
Translator program (NIH awards OT3TR002027 [Red]). Any opinions expressed
in this document are those of the Translator community writ large and do not
10      K. Moodley et al.

necessarily reflect the views of NCATS, individual Translator team members,
or affiliated organizations and institutions.

References
 1. Brush, M.H., Shefchek, K., Haendel, M.: SEPIO: A semantic model for the integra-
    tion and analysis of scientific evidence. In: Proceedings of the Joint International
    Conference on Biological Ontology and BioCreative (2016)
 2. Ciccarese, P., Soiland-Reyes, S., Belhajjame, K., Gray, A.J., Goble, C., Clark, T.: Pav
    ontology: provenance, authoring and versioning. J Biomed Semantics 4(1), 37 (2013)
 3. Dumontier, M., Gray, A.J., Marshall, M.S., Alexiev, V., Ansell, P., Bader, G., Baran, J.,
    Bolleman, J.T., Callahan, A., Cruz-Toledo, J., et al.: The health care and life sciences
    community profile for dataset descriptions. PeerJ 4, e2331 (2016)
 4. Garcia, L., Giraldo, O., Garcia, A., Dumontier, M.: Bioschemas: schema. org for the
    life sciences. Proceedings of SWAT4LS (2017)
 5. Glavic, B., Dittrich, K.R.: Data provenance: A categorization of existing approaches.
    In: BTW. vol. 7, pp. 227–241 (2007)
 6. Hartig, O.: Provenance information in the web of data. In: Proceedings of the Linked
    Data on the Web Workshop (2009)
 7. Hastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, C., Dumontier, M.:
    The chemical information ontology: provenance and disambiguation for chemical
    data on the biological semantic web. PloS one 6(10), e25513 (2011)
 8. Lebo, T., Sahoo, S., McGuinness, D., Khalid Belhajjame, J.C., Corsar, D., Garijo, D.,
    Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The PROV ontology. W3C Recom-
    mendation (2013), http://www.w3.org/TR/prov-o, (last accessed 28 August 2018)
 9. Missier, P., Belhajjame, K., Cheney, J.: The w3c prov family of specifications for
    modelling provenance metadata. In: Proceedings of EDBT. pp. 773–776. ACM (2013)
10. Mons, B., Velterop, J.: Nano-publication in the e-science era. In: Proceedings of the
    Workshop on Semantic Web Applications in Scientific Discourse (2009)
11. Sansone, S.A., Gonzalez-Beltran, A., Rocca-Serra, P., Alter, G., Grethe, J.S., Xu, H.,
    Fore, I.M., Lyle, J., Gururaj, A.E., Chen, X., et al.: Dats, the data tag suite to enable
    discoverability of datasets. Scientific data 4, 170059 (2017)
12. Sarntivijai, S., Vasant, D., Jupp, S., Saunders, G., Bento, A.P., Gonzalez, D., Betts, J.,
    Hasan, S., Koscielny, G., Dunham, I., Parkinson, H., Malone, J.: Linking rare and
    common disease: mapping clinical disease-phenotypes to ontologies in therapeutic
    target validation. Journal of Biomedical Semantics 7(1), 8 (2016)
13. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science.
    SIGMOD Record 34(3), 31–36 (2005)
14. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commu-
    nications of the ACM 57(10), 78–85 (2014)
15. Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L.,
    Evelo, C.T., Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open phacts: semantic
    interoperability for drug discovery. Drug Discovery Today 17(21), 1188 – 1198 (2012)