<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Developing Scientific Knowledge Graphs Using Whyis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>James P. McCusker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabbir M. Rashid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nkechinyere Agu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kristin P. Bennett</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY 12180</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present Whyis, the first framework for creating custom provenance-driven knowledge graphs. Whyis knowledge graphs are based on nanopublications, which simplifies and standardizes the production of structured, provenance-supported knowledge in knowledge graphs. To demonstrate Whyis, we created BioKG, a probabilistic biology knowledge graph, and populated it with well-used drug and protein content from DrugBank, Uniprot, and OBO Foundry ontologies. As shown with BioKG, knowledge graph developers can use Whyis to configure custom knowledge curation pipelines using data importers and semantic extract, transform, and load scripts. Whyis also contains a knowledge metaanalysis capability for use in customizable graph exploration. The flexible, nanopublication-based architecture of Whyis lets knowledge graph developers integrate, extend, and publish knowledge from heterogeneous sources on the web.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Successful scientific knowledge graph construction requires more than simply
storing and serving graph-oriented data. Keeping a knowledge graph up to date
can require developing a knowledge curation pipeline that either replaces the
graph wholesale, whenever updates are made, or requires detailed tracking of
knowledge provenance across multiple data sources. Additionally, non-deductive
forms of knowledge inference have become important to knowledge graph
construction. User interfaces are also key to the success of a knowledge
graphenabled application. Google’s knowledge graph, for instance, takes the semantic
type of the entity into account when rendering information about that entity.
These challenges are dependent on high-quality knowledge provenance that, we
claim, should be inherent in the design of any knowledge graph system, and not
merely an afterthought. We present Whyis by creating and deploying an example
knowledge graph aimed to support our previously custom-designed knowledge
graph for drug repurposing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. BioKG, or Biological Knowledge Graph, uses
data from DrugBank and Uniprot.1 We also discuss the benefits of using our
approach over using a conventional knowledge graph pipeline.
1 http://drugbank.ca and http://uniprot.org, respectively.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        To support these challenges, Whyis provides a semantic analysis ecosystem
(Figure 1). This is an environment that supports research and development of
semantic analytics for which we have previously had to build custom applications
[
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ]. Users interact through views into the knowledge graph, driven by the type
of node and view requested in the URL. Knowledge is curated into the graph
through knowledge curation methods, including Semantic Extract, Transform,
and Load (SETL), external linked data mapping, and Natural Language
Processing (NLP). Autonomous inference agents in Whyis expand the available
knowledge using traditional deductive reasoning as well as inductive methods that can
include predictive models, statistical reasoners, and machine learning.
      </p>
      <p>
        As a nanopublication-based knowledge graph, Whyis uses nanopublications
to encapsulate every piece of knowledge introduced into the knowledge graphs it
manages. Even in cases where the user does not explicitly provide provenance,
Whyis collects provenance of who created the assertion and when. Introduced
in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and expanded on in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a nanopublication is composed of three named
RDF graphs: an Assertion graph, a Provenance graph, and a Publication Info
graph. We see knowledge graphs that include the level of granularity supported
by nanopublications as essential to fine-grained management of knowledge in
knowledge graphs that are curated and inferred from diverse sources and can
change on an ongoing basis. Other systems, like DBpedia2 and Uniprot, have
very rough grained management of knowledge using very large named graphs,
which limit the ability to version, explain, infer from, and annotate the
knowledge. With nanopublications, provenance of both the assertion and the artifact
it is contained in is included in each edit of the knowledge graph. It is
therefore possible to track user-contributed provenance as well as automatically
collect provenance. For instance, Whyis tracks who composed the nanopublication
and when it was edited using PROV-O and Dublin Core Terms. In the case of
mapping in linked data, Whyis tracks where the graph was imported from. An
example nanopublication from BioKG is shown in Figure 2. Whyis is written in
Python using the Flask framework, and uses a number of existing infrastructure
tools to work..
      </p>
      <p>Knowledge Inference in Whyis is performed by a suite of “Agents”, each
performing the analogue to a single rule in traditional deductive inferencing. An
agent is composed of a SPARQL query that serves as the “rule body” and a
python function that serves as the “head”. An agent is invoked when new
nanopublications are added to the knowledge graph that match its SPARQL query.
The rule body function produces RDF nanopublications as output, which are
then in turn processed by other agents. The agent superclass assigns some basic
provenance and publication information related to the given inference activity,
which developers can add to in their implementations. Included inference agent
types include entity extraction and resolution against existing knowledge graph</p>
      <sec id="sec-2-1">
        <title>2 http://dbpedia.org</title>
        <p>Predictive
Modelers
Machine
Learning</p>
        <p>Statistical
Reasoners
Deductive</p>
        <p>Reasoners
Knowledge Inference</p>
        <p>Users</p>
        <p>Answers,</p>
        <p>Questions Explanations
Semantic Semantic
Annotators Browsers</p>
        <p>Cognitive</p>
        <p>Agents
Visualization,</p>
        <p>Analysis
Contributed</p>
        <p>Knowledge
Inferred/Expanded</p>
        <p>Knowledge</p>
        <p>Knowledge
and Data</p>
        <p>Knowledge
Interaction,</p>
        <p>Creation,</p>
        <p>Exploration
Hypotheses
Knowledge
Ontologies</p>
        <p>Data</p>
        <p>Results</p>
        <p>Knowledge
and Data</p>
        <p>NLP,
Machine Reading
Semantic ETL,</p>
        <p>SDDs
Mapping</p>
        <p>Linked
Open
Data</p>
        <p>Literature
Databases</p>
        <p>Public
Datasets
Knowledge</p>
        <p>Curation
nodes, deductive reasoning agents that can be configured with custom rules, as
well as many available pre-configured OWL 2 rules.</p>
        <p>
          To create probabilistic knowledge graphs, Whyis implements a method of
automated meta-analysis using the provenance of graph assertions when
computing which nodes are related to each other. We refined the methods used in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
by using Stouffer’s Z-Method [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to produce an overall probability for the link
claim. We allow developers to determine, on a per-node type basis, what
constitutes a link to or from that node. For biological entities in BioKG, outgoing
links are defined as instances of sio:Interacting, where the current node is the
sio:hasAgent, and the connected node, the sio:hasTarget. The default
implementation uses sio:ProbabilityMeasure attributes on the nanopublication assertions.
These attributes can be asserted directly, or developers can write an inference
agent that computes values for them. If those attributes are not available, the
default implementation looks at the number of references cited in the
assertion’s provenance, and assigns the configured base rate (in BioKG, focusing on
results from published literature, we set it to 0:8) to each one. The resulting
probabilities for each nanopublication that claims the link are then combined
using Stouffer’s Z-Method. The resulting graph is customizable to the domain;
provides on-demand probabilities for each link; and is explorable, in that, for
every known node in the graph, we are able to find related nodes. This
neighborhood graph, as shown in Figure 3, is on the default page of every node, and
has an interface that lets users search and explore the wider knowledge graph
by expanding nodes, filtering on probability, and searching for nodes by name.
Some existing frameworks support some of Stardog3 includes OWL reasoning,
mapping of data silos into RDF, and custom rules. Ontowiki provides a user
interface on top of an RDF database that tracks history, allows users to browse
and edit knowledge, and supports user interface extensions 4. Callimachus lets
developers create UIs by object type using RDFa [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Virtuoso Openlink Data
Spaces is a linked data publishing tool that provides a set of pre-defined data
import tools and a fixed set of views on the linked data it creates.5 Vitro6
supports the creation of new ontology classes and instances, but does not allow
users to create custom interfaces.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Building BioKG using Whyis</title>
      <p>
        We show Whyis by building a biology knowledge graph, BioKG, at http://
bit.ly/whyis-demo. BioKG, uses the built-in knowledge exploration and
semantic meta analysis capabilities of Whyis to support probabilistic knowledge
3 A case study: https://www.stardog.com/blog/nasas-knowledge-graph/
4 http://ontowiki.net
5 https://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/Ods
6 Available: https://github.com/vivo-project/Vitro
graph exploration. Whyis supports a large number of knowledge graph use cases
that supported the development of BioKG in areas of knowledge curation,
interaction, and inference, which we have detailed on the Whyis website.7 The
code used to configure BioKG is available on GitHub at https://github.com/
tetherless-world/biokg. BioKG provides on-demand mappings to entities in
OBO Foundries8, Uniprot, DOI [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and DBpedia, and has SETL scripts for
loading DrugBank. Users can explore the graph through search and the list of
recently updated entities on the default landing page.
      </p>
      <p>
        The BioKG knowledge graph was written in 3 active development days
using 614 lines of code, including templates, configuration, and data curation
scripts. The conversion of DrugBank to RDF is especially notable. In the original
drug repurposing project, we used an Extensible Stylesheets Language Template
(XSLT) to generate RDF. Many conversion errors were introduced through the
need to generate RDF as plain text, including occasional spaces in URIs and
other invalid syntax. The SETL script, on the other hand, was able to validate
the RDF as it was generated, and the resulting template is much more legible.
We continue to develop the content, user interface, and inference capabilities of
Whyis and BioKG to research beyond the probabilistic graph exploration laid
out in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We are developing new methods for display in the knowledge explorer
user interface. We are working on a suite of integrated inference agents. Inductive
7 http://tetherless-world.github.io/whyis/usecases
8 http://obofoundry.org
agents will use statistics and machine learning algorithms to infer new knowledge
and relationships in the knowledge graph. For example, we plan to investigate
graph learning algorithms like [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that can learn from the structure of published
knowledge graphs to find new relations.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We introduced Whyis as the first provenance-aware open source framework for
knowledge graph development that supports knowledge graph curation,
interaction, and inference within a unified semantic analysis ecosystem. We discussed
the importance of nanopublications in the architecture of Whyis and why it is
valuable to develop knowledge graphs that are built on nanopublications. We
showed how Whyis is able to create the biological knowledge graph BioKG, and
how we use semantic meta-analysis to support its use as a probabilistic
knowledge graph. Whyis is published under the Apache 2.0 License on Github9, with
additional documentation for how to develop custom knowledge graphs.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was funded by NIEHS Award 0255-0236-4609 / 1U2CES026555-01,
NSF Award OAC-1640840 IBM Research AI Horizons Network, and by the Gates
Foundation through HBGDki.</p>
      <sec id="sec-5-1">
        <title>9 https://tetherless-world.github.io/whyis</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Battle</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leigh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The callimachus project: Rdfa as a web template language</article-title>
          .
          <source>In: Proceedings of the Third International Conference on Consuming Linked Data-Volume</source>
          <volume>905</volume>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . CEUR-WS. org (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velterop</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The anatomy of a nanopublication</article-title>
          .
          <source>Information Services and Use</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>51</fpage>
          -
          <lpage>56</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Learning entity and relation embeddings for knowledge graph completion</article-title>
          .
          <source>In: AAAI</source>
          . vol.
          <volume>15</volume>
          , pp.
          <fpage>2181</fpage>
          -
          <lpage>2187</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dordick</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          :
          <article-title>Finding melanoma drugs through a probabilistic knowledge graph</article-title>
          .
          <source>PeerJ Computer Science</source>
          <volume>3</volume>
          ,
          <issue>e106</issue>
          (
          <year>Feb 2017</year>
          ), https://doi.org/10.7717/peerj-cs.
          <fpage>106</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Integrating semantics and numerics: Case study on enhancing genomic and disease data using linked data technologies</article-title>
          .
          <source>Proceedings of SmartData</source>
          pp.
          <fpage>18</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mons</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velterop</surname>
          </string-name>
          , J.:
          <article-title>Nano-publication in the e-science era</article-title>
          .
          <source>In: Workshop on Semantic Web Applications in Scientific Discourse (SWASD</source>
          <year>2009</year>
          ). pp.
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Paskin</surname>
          </string-name>
          , N.:
          <article-title>Digital object identifier (doi R ) system</article-title>
          .
          <source>Encyclopedia of library and information sciences 3</source>
          ,
          <fpage>1586</fpage>
          -
          <lpage>1592</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Whitlock</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Combining probability from independent tests: the weighted zmethod is superior to fisher's approach</article-title>
          .
          <source>Journal of Evolutionary Biology</source>
          <volume>18</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1368</fpage>
          -
          <lpage>1373</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>