K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al
The DISK Hypothesis Ontology:
Capturing Hypothesis Evolution for Automated Discovery
Daniel Garijo, Yolanda Gil and Varun Ratnakar
Information Sciences Institute, University of Southern California, Marina del Rey, CA, U.S.A
{dgarijo, gil, varunr}@isi.edu
ABSTRACT Creating machine readable representations of research hypotheses
would facilitate the organization and management of the
Automated discovery systems can formulate and revise literature. To date there is not a standard way of capturing the
hypotheses by gathering and analyzing data. In order to generate contents and context of a hypothesis to understand its evolution.
new hypotheses and provide explanations of their new findings, Another important use of formal hypothesis representations is
these systems need a language to represent hypotheses, their to enable automated discovery systems to do hypothesis testing
revisions, and their provenance. This paper describes the DISK and revision. Autonomous discovery systems generate hypotheses
hypothesis ontology which fulfills these requirements. The paper autonomously based on analysis of relevant data [Pankratius et al
then presents a survey of existing models for representing 2016; King 2017; Gil et al 2017].
hypotheses along with their features and tradeoffs. We compare In this paper, we focus on hypothesis representations to
these hypothesis models in the context of automated discovery capture hypothesis evolution in automated discovery systems. We
and hypothesis evolution. discuss the requirements that we have found throughout work on
CCS CONCEPTS the DISK discovery system [Gil et al 2017]. We propose an
ontology for hypothesis representation, and compare it to existing
• Information systems → Artificial intelligence; Knowledge models for representing hypotheses.
representation and reasoning The rest of the paper is organized as follows. Section 2
describes the DISK automated discovery system, and introduces
KEYWORDS its hypothesis ontology. Section 3 introduces an evaluation
Hypothesis representation, hypothesis evolution, framework for existing models and overviews them. Section 4
nanopublications, micropublications, automated discovery, discusses the different alternatives for hypothesis representation,
ontologies. and Section 5 concludes the paper.
1 INTRODUCTION 2 REPRESENTING HYPOTHESES IN THE
Formal representations of scientific hypotheses would be useful in DISK AUTOMATED DISCOVERY SYSTEM
many contexts. For instance, in order to keep up with the latest Our goal is to allow automated discovery systems to test
updates on a research area, scientists need to quickly understand hypotheses provided by users, and revise them based on the
the contributions of an article and how it was derived from others. results of running computational experiments autonomously.
However, the vast amount of new scientific publications makes In prior work, we introduced an approach that captures
this task increasingly complex. If scientists represented scientists’ strategies for pursuing hypotheses as lines of inquiry
hypotheses formally in publications, related literature could be that specify the data to be retrieved, the experimental workflows
easily searched for hypotheses of interest. Alternatively, machine to run, and how to combine the results to generate a revised
reading systems could also extract hypotheses from text in confidence level and in some cases a revised hypothesis [Gil et al
articles, and generate these formal representations. 2016]. This approach was implemented in the DISK framework
Formal representations of hypotheses may also be used to (Automated DIscovery of Scientific Knowledge) and
improve reproducibility. Community initiatives on reproducibility demonstrated for cancer multi-omics [Gil et al 2017]. DISK is
promote registering hypotheses and methods before conducting given a hypothesis statement, such as whether a protein is
the research [Munafo et al 2017]. Hypotheses are stated in textual associated with a type of cancer, and returns either a confidence
form, which can express arbitrarily complex statements about level on that hypothesis or a revised hypothesis that refers to a
hypotheses. However, text can be imprecise and ambiguous. mutation of the protein or a more specific type of cancer. As new
data becomes available, DISK re-runs the analysis and
continuously revises the original hypothesis. DISK tracks the
K-CAP2017 Workshops and Tutorials Proceedings, provenance of revised hypotheses in terms of the original
© Copyright held by the owner/author(s). hypotheses and the data analyses that were carried out.
K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al.
Figure 1. Representing hypotheses in the DISK automated discovery system using the DISK hypothesis ontology. The initial
hypothesis statement HS1 is provided by the user. It is then tested through data analysis, which provides evidence HE2 for the
hypothesis, a new hypothesis statement HS1, and a qualification HQ2 with a confidence level L1. The revised hypothesis HG2 is a
revision of HG1, indicated by a link.
Figure 2. Representing hypothesis evolution in the DISK automated discovery system using the DISK hypothesis ontology. In this
example, additional data of two different types becomes available, causing the system to trigger two separate analyses whose results
are hard to combine. A revised hypothesis statement HS3 is added with a new confidence level L2 (included as part of HQ3) backed
by one of the analyses as evidence HE3. The other analysis HE4 qualifies HS3 with HQ4.
K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al
DISK uses a representation of hypotheses that is needed to 3.1 Comparing hypothesis models
track their evolution. In DISK, a hypothesis consists of:
1. A hypothesis statement, which is a set of structured In our analysis, we consider the following key aspects, based on
assertions about entities in the domain. For example, that the the representation presented in Section 2:
protein EGFR is associated with colon cancer.
1. Statement: Does the model have a representation for
2. A hypothesis qualifier, which represents the veracity of the
hypothesis based on the data and the analyses done so far. A statements in a hypothesis?
typical qualifier is a numeric confidence level. For example, for 2. Qualifier: Does the model have a means to qualify a
the hypothesis statement above we could have a confidence level hypothesis with a confidence level?
given by a p-value of 0.07. 3. Evidence: Does the model describe the supporting evidence
3. Hypothesis evidence, which is a record of the analyses that for a hypothesis?
were carried out to test a hypothesis statement. For example, the 4. History: Does the model represent the relationship between
evidence of a given hypothesis may include an analysis of mass hypothesis revisions?
spectrometry data for 25 patients with colon cancer and 25 healthy
controls followed by clustering, cluster metrics and binary In addition, the following aspects are desirable for flexibility and
hypothesis testing.
extensibility:
4. A hypothesis history, which points to prior hypotheses that
were revised to generate the current one. In our example, a 5. Classification: Does the vocabulary support a taxonomy of
hypothesis such as the association of protein EGFR with colon hypothesis statements?
cancer SubType A would link back to the original hypothesis 6. Standards: Is the model defined using standards or does it
statement that protein EGFR is associated with colon cancer. use proprietary or idiosyncratic formats?
DISK represents hypothesis statements as a graph, where the
nodes are the entities in the hypotheses and the links are their
relationships. In our work, a hypothesis statement is represented 3.2 Models for representing hypotheses
in RDF as a simple triple, and the triple is linked to its qualifier,
This section introduces different approaches to represent
evidence, and history. All those assertions are also made in RDF.
hypothesis at different levels of granularity. We group them based
The hypothesis evidence and hypothesis history both represent
according to the level of detail at which they describe hypotheses:
different aspects of provenance for the hypothesis. This is coarse-grained and fine-grained representations.
captured using the PROV provenance standard [Lebo et al 2013].
Figure 1 illustrates this representation using the running 3.2.1 Coarse-grained hypothesis models
example with protein EGFR. The original hypothesis HG1 had its
own statement HS1 and evidence HE1. The revised hypothesis We group under this section those vocabularies that include main
HG2 includes its statement HS2, its confidence level L1 (part of concepts to identify hypotheses, but do not include the means to
the qualifier HQ2), its evidence HE2, and a link to the original qualify them or describe them at a statement level. For example,
hypothesis HG1. A feature of this representation is the ability to popular vocabularies like the Semantic Web for Earth and
model different confidence levels associated to a hypothesis Environmental Terminology Ontology1(SWEET) [Raskin and
statement. This often happens when evidence is obtained from Pan 2005] contain modules for defining hypotheses as
analyzing different types of data and it is unclear how to combine “Experimental Activities”. Likewise, the Ontology for
the resulting confidence levels. Figure 2 shows an example. HS3 Biomedical Investigations (OBI)2 [Brandowski et al 2016] and
is qualified with two confidence reports (C2 and C3), which have the Ontology for Clinical Research (OCRe)3 [Sim et al 2014]
different supporting evidence (HE3 and HE4) each resulting from have concepts to refer to a hypothesis in the context of a
a different data source. biological experiment.
The DISK hypothesis ontology is available in OWL and Other vocabularies include terms to further describe
documented in [Garijo et al 2017]. A major focus of the DISK hypotheses. The EXPO Ontology aims to define a model for
hypothesis ontology is capturing hypothesis evolution. The rest of representing scientific experiments, "including generic knowledge
this paper focuses on comparing this ontology to other about scientific experimental design, methodology and results
representations of scientific hypotheses in the literature. representation" [Soldatova and King, 2006]. The EXPO Ontology
extends common upper level ontologies in order to bridge the gap
3 A SURVEY OF HYPOTHESIS between domain specific experiment formalization and upper
REPRESENTATIONS level ontologies. EXPO aims at describing scientific papers, and
has a specific part designed for the description of hypotheses. The
In this section we present a survey of existing models of scientific
hypotheses and assess their features to support automated
1
discovery. http://sweet.jpl.nasa.gov/2.3/reprSciModel.owl
2
http://purl.obolibrary.org/obo/OBI_0001908
3
http://purl.org/net/OCRe/OCRe.owl#OCRE400032
3
K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al.
focus of EXPO is on how the hypothesis is defined on a research sub:provenance { ##provenance of the assertion graph
sub: hypothesisAssertion prov:generatedAtTime "2012-02-
paper (the "part of" relationship between the scientific experiment 03T14:38:00Z"^^xsd:dateTime ;
and the hypothesis), rather than identifying the statements ex:hasConfidenceReport ex:conf1.
prov:wasAttributedTo ex:experimentScientist .
contained by the hypothesis itself. However, different classes of ex:conf1 a ex:ConfidenceReport;
ex:hasConfidenceLevel "0.6".
hypothesis are identified in the ontology (i.e., null hypothesis, prov:wasGeneratedBy ex:execution1.
research hypothesis and scientific hypothesis). }
sub:pubInfo {##publication information of the user who
Finally, the Linked Science Vocabulary 4 proposes a performed the hypothesis
lightweight model to express support to hypothesis by some : prov:generatedAtTime "2016-03-26T12:45:00Z"^^xsd:dateTime;
prov:wasAttributedTo ex:user1 .
research. A hypothesis is represented to make predictions about }
facts, but it is not described at a statement level.
The ovopublication model proposes a simple approach
3.2.2 Fine grained hypothesis models designed to capture the provenance of assertions [Callahan and
Dumontier 2013]. When contrasted with nanopublications, "the
We group in this section those approaches that provide the means ovopub is simpler as it consists of only a single named graph with
to represent in detail the statements belonging to a hypothesis, key provenance information directly contained in and associated
along with their metadata. with the ovopub graph" [Callahan and Dumontier 2013].
LABORS [Soldatova and Rzhetsky 2011] is designed to Ovopublications mix the notion of named graphs with reification
to refer to the different components and relationships of the own
support investigations run by an automated system for the area of
ovopublication. The Ovopub model is integrated as part of the
Systems Biology and Functional Genomics. LABORS uses EXPO Semanticscience Integrated Ontology (SIO)7, which also provides
as an upper level ontology, and splits the representation of the means to describe hypothesis as literals
hypotheses into textual and logical representations, using concepts The Semantic Web Applications in Neuromedicine
from OBI and other upper level ontologies. It also allows (SWAN) ontology8 [Ciccarese et al 2008] aims to represent the
aggregating hypotheses with multiple statements in hypothesis scientific discourse of bio-medicine papers in general and neuro-
sets, using a Datalog representation for each hypothesis statement. medicine papers in particular. The model is composed of several
The nanopublication model 5 [Groth et al 2010] aims to modules for representing discourse elements and their
represent “the smallest unit of publishable information”, i.e., relationships, different types of agents, the roles, provenance and
every assertion that is part of a hypothesis graph. versioning of a given statement and bibliographic references.
SWAN was designed to describe statements in papers (along with
Nanopublications are composed of three main graphs: An
the evidence supporting them). If we consider a hypothesis as a
assertion graph containing the assertion or multiple assertions text statement, the following example illustrates the SWAN
which are part of the nanopublication, a provenance graph with model:
the statements that describe the provenance of the assertion graph
(e.g., the assertion graph came from a publication, a scientific @prefix swande: .
experiment, etc.); and lastly a publication info graph which @prefix swanco: .
contains the metadata about the nanopublication itself. (e.g., who @prefix swanqs: .
@prefix swandr: .
created, etc.). Each of the graphs is represented using a named @prefix swanpav: .
@prefix swanci: .
graph,6 so as to be able to describe it properly with metadata from
any of the other graphs. An example can be seen in the snippet ex:hypothesis a swande:ResearchStatement ;
swande:title "EGFR is associated with colon cancer
below, where a hypothesis H1 as in Figure 1 is represented with subtype A"@en;
its provenance (sub:provenance), assertion swanco:researchStatementQualifiedAs
;
swanci:derivedFrom ex:execution1;
ex:hasConfidenceReport ex:c1;
@prefix sub: . swanpav:authoredBy ex:experimentScientist;
@prefix np: . swanpav:createdOn 2012-02-03T14:38:00Z"^^xsd:dateTime .
@prefix prov: .
@prefix xsd: .
@prefix ex: In the example, a hypothesis is extracted from a research
sub:defaultGraph {
sub:n1 np:hasAssertion sub: hypothesisAssertion; article. The hypothesis is represented as a statement, which can be
np:hasProvenance sub:provenance ; further described with SWAN. The provenance of the hypothesis
np:hasPublicationInfo sub:pubInfo ;
a np:Nanopublication, ex:Hypothesis . is represented as well by representing the agents who created the
} hypothesis statement.
sub:hypothesisAssertion {##statements contained in the
hypothesis graph
ex:EGFR ex:associatedWith ex:ColonCancer .}
4
http://linkedscience.org/lsc/ns/
5 7
http://www.nanopub.org/nschema# http://semanticscience.org/ontology/sio.owl
6 8
https://www.w3.org/TR/rdf11-concepts/ https://www.w3.org/TR/hcls-swan/
K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al
Figure 3: The example from Figure 1 adapted to the micropublication model, following [Clark et al 2014]. The namespaces indicate
the ontology used: mp for micropublications, prov for the PROV ontology, and ext for the extension that would need to be added.
Finally, micropublications 9 [Clark et al 2014] are derived The lower half of Table 1 corresponds to fine-grained models
from the SWAN model and can be considered a refinement of the to describe hypotheses, either defining classes and properties to
nanopublication model. Micropublications propose a semantic qualify hypothesis statements with provenance metadata or
model of scientific argumentation and evidence that supports relating its different parts together. Among these, the
natural language statements, data and materials specifications, nanopublication and micropublication models are the most
discussion, etc. Figure 3 shows an illustrative example, where a flexible approaches, compliant with most of the requirements of
micropublication uses a mechanism similar to an assertion graph the DISK model (in the last row). LABORS uses a datalog
to represent the claim of a protein being associated with a subtype representation for describing hypothesis statements and is domain
of colon cancer, along with its supporting evidence. The specific. The ovopublications model is a simplification of the
micropublication model uses the Web Annotation Ontology10 to nanopublication model to include provenance of assertions or
associate a micropublication and its contents with text from collections of assertions. Although it could be used for hypothesis
articles. representation, we consider that the model would need to be
thoroughly extended. Similarly, the SWAN model is extended in
4 DISCUSSION the micropublication approach to represent argumentation of facts
in publications. Therefore, the nanopublication and
Table 1 summarizes the different candidate models for hypothesis micropublication models provide a richer initial framework.
representation in automated discovery systems, according to the A major difference between micropublications and
features described in Section 3.1. Most models lack support for nanopublications is the scope of the domain. For instance,
qualifying a given hypothesis with confidence levels. In order to micropublications was explicitly designed to model facts and
overcome this issue, we may follow an approach similar to Figure argumentation of text statements. If an automated discovery
1: extend the target model with a class (confidence Report) and system aims to represent single assertions of hypotheses and their
two properties (hasConfidenceReport and hasConfidenceLevel) evolution, then an argumentation framework such as the one
linking them together. A reason why the confidence level may not proposed in the micropublication model is not necessary. In
be directly linked to a hypothesis is that the same hypothesis may contrast, if the provenance trace includes all evidence to support a
be evaluated at different points in time, resulting in multiple particular claim made in a hypothesis, then micropublications are
confidence levels with different provenance information each an appropriate model to use.
included in a separate confidence report. Another aspect to consider is the support from the
The upper half of Table 1 corresponds to the models for communities that are using these models. The nanopublication
coarse grained hypothesis representation. These models include a model has been discussed for some time, and has available
main concept to refer to a hypothesis, but lack the means to tooling, documentation and examples. 11 The micropublication
describe hypothesis statements. Therefore, they do not meet the model has been documented in detail with examples [Clark et al
majority of requirements that DISK requires for representing 2014], but it has not yet reached the level of adoption and tooling
hypothesis statements, qualifiers, history and evidence. However, that nanopublications have.
the LinkedScience, OBI and EXPO vocabularies define different
types of hypotheses, and may be potential candidates for reuse if
we need to define a hypothesis taxonomy.
9
http://purl.org/mp
10 11
https://www.w3.org/ns/oa http://nanopub.org/
5
K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al.
Table 1: Overview of models for hypothesis representation.
Hypothesis Model Hypothesis Hypothesis Hypothesis Hypothesis Hypothesis Use of
statement qualifier evidence history classification standards
SWEET [Raskin and Pan 2005] No No No No No Yes (OWL)
OBI [Brandowski et al 2016] No No No No Yes Yes (OWL)
EXPO [Soldatova and King 2006] No No No No Yes Yes (OWL)
OCR [Sim et al 2014] No No No No No Yes (OWL)
Linked Science Vocabulary No No Partly No No Yes (OWL)
LABORS [Soldatova and Rzhetsky No No Yes No Yes Yes (OWL)
2011]
Nanopublications [Groth et al 2010] Text/ No Yes Yes No Yes (OWL),
structured named graphs
Ovopublications [Callahan and Text/ No No Yes No Yes (OWL),
Dumontier 2013] structured named graphs
SWAN [Ciccarese et al 2008] Text No Yes Yes No Yes (OWL)
Micropublications [Clark et al 2014] Text Yes Yes No No Yes (OWL),
named graphs
DISK [Garijo et al 2017] Structured Yes Yes Yes No Yes (OWL),
named graphs
Finally, both the nanopublication and micropublication
models present an important limitation for representing ACKNOWLEDGMENTS
hypotheses: they have been designed to describe simple facts, i.e.,
We gratefully acknowledge support from the Defense Advanced
single statements or a single collection of statements as part of
Research Projects Agency through the SIMPLEX program with
their claim. In the nanopublication model this is reflected by award W911NF-15-1-0555, and from the National Institutes of
having a unique assertion graph per nanopublication, containing Health under award 1R01GM117097. We also thank our
one or more statements. If we wanted to describe a hypothesis collaborators in the DISK project, especially Parag Mallick,
composed of multiple statements, each with confidence levels Ravali Adusumilli, and Hunter Boyce for their useful feedback on
assigned independently by different experiments, we would have this work.
to extend the nanopublication model. A possibility may be
creating a new class (a hypothesis composition concept such as REFERENCES
the “hypotheses-set” in LABORS) that aggregates each of its
statements as an individual nanopublication. Likewise, each [Callahan and Dumontier 2013] Alison Callahan and Michel
micropublication contains a main claim graph and its support. A Dumontier. Ovopub: Modular data publication with minimal
mechanism for extending and aggregating micropublications provenance. arXiv preprint arXiv:1305.6800, 2013.
would also be needed to represent hypothesis with multiple [Brandrowski et al 2016] Bandrowski A, Brinkman R,
statements. Note that the extension would only be necessary in Brochhausen M, Brush MH, Bug B, et al. (2016) The Ontology
both models if we wanted to keep the provenance for each for Biomedical Investigations. PLOS ONE 11(4): e0154556.
statement of the hypothesis. Otherwise they can be included in the https://doi.org/10.1371/journal.pone.0154556
assertion graph in the case of nanopublications or the claim graph [Clark et al 2014] Tim Clark, Paolo N. Ciccarese and Carole A.
in the case of micropublications. Goble. Micropublications: a semantic model for claims,
evidence, arguments and annotations in biomedical
5 CONCLUSIONS AND FUTURE WORK communications. Journal of Biomedical Semantics 2014, 5:28.
[Ciccarese et al 2008] Ciccarese P, Wu E, Kinoshita J, et al. The
In this paper we introduced the DISK hypothesis ontology for SWAN Scientific Discourse Ontology. Journal of biomedical
representing hypotheses evolution, which was developed for the informatics. 2008;41(5):739-751.
DISK automated discovery system. We also presented a survey doi:10.1016/j.jbi.2008.04.010.
of existing vocabularies to represent hypotheses, and assessed [Garijo et al 2017] The DISK Hypothesis Ontology. Version
their suitability in the context of automated knowledge discovery. 1.0.0. Available from http://disk-project.org/ontology/disk#
Future work includes extending the DISK ontology to align with [Gil et al 2016] Gil, Y.; Garijo, D.; Ratnakar, V.; Mayani, R.;
these models. Adusumilli, R.; and Boyce, H. Automated Hypothesis Testing
with Large Scientific Data Repositories. In Proceedings of the
Fourth Annual Conference on Advances in Cognitive Systems
(ACS), pages 1-6, 2016.
K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al.
[Gil et al 2017] Gil, Y.; Garijo, D.; Ratnakar, V.; Mayani, R.;
Adusumilli, R.; Boyce, H.; Srivastava, A.; and Mallick, P.
Towards Continuous Scientific Data Analysis and Hypothesis
Evolution. In Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence (AAAI-17), 2017.
[Groth et al 2010] Groth, Paul; Gibson, Andrew; Velterop, Jan.
The anatomy of a nanopublication. Information Services and
Use, 30, 1-2: 52-56, 2010.
[King 2017] Ross King. The Adam and Eve Robot Scientists for
the Automated Discovery of Scientific Knowledge. Bulletin of
the American Physical Society, 2017
[Lebo et al 2013] Lebo, T., McGuiness, D., Belhajjame, K.,
Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik,
S., and Zhao, J. (2013). The PROV ontology, W3C
recommendation. Technical report, World Wide Web
Consortium (W3C), 30th April 2013.
[Munafo et al 2017] Marcus R. Munafò, Brian A. Nosek, Dorothy
V. M. Bishop, Katherine S. Button, Christopher D. Chambers,
Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan
Wagenmakers, Jennifer J. Ware & John P. A. Ioannidis. A
manifesto for reproducible science. Nature Human Behaviour
1, Article number: 0021 (2017). doi:10.1038/s41562-016-0021
[Pankratius et al 2016] V. Pankratius, J. Li, M. Gowanlock, D.
Blair, C. Rude, T. Herring, F. Lind, P. Erickson, C. Lonsdale,
Computer-Aided Discovery: Towards Scientific Insight
Generation with Machine Support. IEEE Intelligent Systems
31(4), pp. 3-10, Jul/Aug 2016.
[Raskin and Pan 2005] Robert G. Raskin and Michael J. Pan.
Knowledge representation in the semantic web for Earth and
environmental terminology (SWEET). Computers &
Geosciences 31(9):1119-1125, November 2005.
doi:10.1016/j.cageo.2004.12.004.
[Sim et al 2014] Sim I, Tu SW, Carini S, et al. The Ontology of
Clinical Research (OCRe): An Informatics Foundation for the
Science of Clinical Research. Journal of biomedical
informatics. 2014;52:78-91. doi:10.1016/j.jbi.2013.11.002.
[Soldatova and King 2006]: Soldatova, LN & King, RD. (2006)
An Ontology of Scientific Experiments. Journal of the Royal
Society Interface, 3(11):795-803, 2006.
doi:10.1098/rsif.2006.0134.
[Soldatova and Rzhetsky 2011]: Soldatova, LN and Rzhetsky, A.
Representation of research hypotheses. Journal of Biomedical
Semantics20112(Suppl 2):S9. 2011.
https://doi.org/10.1186/2041-1480-2-S2-S9