K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al The DISK Hypothesis Ontology: Capturing Hypothesis Evolution for Automated Discovery Daniel Garijo, Yolanda Gil and Varun Ratnakar Information Sciences Institute, University of Southern California, Marina del Rey, CA, U.S.A {dgarijo, gil, varunr}@isi.edu ABSTRACT Creating machine readable representations of research hypotheses would facilitate the organization and management of the Automated discovery systems can formulate and revise literature. To date there is not a standard way of capturing the hypotheses by gathering and analyzing data. In order to generate contents and context of a hypothesis to understand its evolution. new hypotheses and provide explanations of their new findings, Another important use of formal hypothesis representations is these systems need a language to represent hypotheses, their to enable automated discovery systems to do hypothesis testing revisions, and their provenance. This paper describes the DISK and revision. Autonomous discovery systems generate hypotheses hypothesis ontology which fulfills these requirements. The paper autonomously based on analysis of relevant data [Pankratius et al then presents a survey of existing models for representing 2016; King 2017; Gil et al 2017]. hypotheses along with their features and tradeoffs. We compare In this paper, we focus on hypothesis representations to these hypothesis models in the context of automated discovery capture hypothesis evolution in automated discovery systems. We and hypothesis evolution. discuss the requirements that we have found throughout work on CCS CONCEPTS the DISK discovery system [Gil et al 2017]. We propose an ontology for hypothesis representation, and compare it to existing • Information systems → Artificial intelligence; Knowledge models for representing hypotheses. representation and reasoning The rest of the paper is organized as follows. Section 2 describes the DISK automated discovery system, and introduces KEYWORDS its hypothesis ontology. Section 3 introduces an evaluation Hypothesis representation, hypothesis evolution, framework for existing models and overviews them. Section 4 nanopublications, micropublications, automated discovery, discusses the different alternatives for hypothesis representation, ontologies. and Section 5 concludes the paper. 1 INTRODUCTION 2 REPRESENTING HYPOTHESES IN THE Formal representations of scientific hypotheses would be useful in DISK AUTOMATED DISCOVERY SYSTEM many contexts. For instance, in order to keep up with the latest Our goal is to allow automated discovery systems to test updates on a research area, scientists need to quickly understand hypotheses provided by users, and revise them based on the the contributions of an article and how it was derived from others. results of running computational experiments autonomously. However, the vast amount of new scientific publications makes In prior work, we introduced an approach that captures this task increasingly complex. If scientists represented scientists’ strategies for pursuing hypotheses as lines of inquiry hypotheses formally in publications, related literature could be that specify the data to be retrieved, the experimental workflows easily searched for hypotheses of interest. Alternatively, machine to run, and how to combine the results to generate a revised reading systems could also extract hypotheses from text in confidence level and in some cases a revised hypothesis [Gil et al articles, and generate these formal representations. 2016]. This approach was implemented in the DISK framework Formal representations of hypotheses may also be used to (Automated DIscovery of Scientific Knowledge) and improve reproducibility. Community initiatives on reproducibility demonstrated for cancer multi-omics [Gil et al 2017]. DISK is promote registering hypotheses and methods before conducting given a hypothesis statement, such as whether a protein is the research [Munafo et al 2017]. Hypotheses are stated in textual associated with a type of cancer, and returns either a confidence form, which can express arbitrarily complex statements about level on that hypothesis or a revised hypothesis that refers to a hypotheses. However, text can be imprecise and ambiguous. mutation of the protein or a more specific type of cancer. As new data becomes available, DISK re-runs the analysis and continuously revises the original hypothesis. DISK tracks the K-CAP2017 Workshops and Tutorials Proceedings, provenance of revised hypotheses in terms of the original © Copyright held by the owner/author(s). hypotheses and the data analyses that were carried out. K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al. Figure 1. Representing hypotheses in the DISK automated discovery system using the DISK hypothesis ontology. The initial hypothesis statement HS1 is provided by the user. It is then tested through data analysis, which provides evidence HE2 for the hypothesis, a new hypothesis statement HS1, and a qualification HQ2 with a confidence level L1. The revised hypothesis HG2 is a revision of HG1, indicated by a link. Figure 2. Representing hypothesis evolution in the DISK automated discovery system using the DISK hypothesis ontology. In this example, additional data of two different types becomes available, causing the system to trigger two separate analyses whose results are hard to combine. A revised hypothesis statement HS3 is added with a new confidence level L2 (included as part of HQ3) backed by one of the analyses as evidence HE3. The other analysis HE4 qualifies HS3 with HQ4. K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al DISK uses a representation of hypotheses that is needed to 3.1 Comparing hypothesis models track their evolution. In DISK, a hypothesis consists of: 1. A hypothesis statement, which is a set of structured In our analysis, we consider the following key aspects, based on assertions about entities in the domain. For example, that the the representation presented in Section 2: protein EGFR is associated with colon cancer. 1. Statement: Does the model have a representation for 2. A hypothesis qualifier, which represents the veracity of the hypothesis based on the data and the analyses done so far. A statements in a hypothesis? typical qualifier is a numeric confidence level. For example, for 2. Qualifier: Does the model have a means to qualify a the hypothesis statement above we could have a confidence level hypothesis with a confidence level? given by a p-value of 0.07. 3. Evidence: Does the model describe the supporting evidence 3. Hypothesis evidence, which is a record of the analyses that for a hypothesis? were carried out to test a hypothesis statement. For example, the 4. History: Does the model represent the relationship between evidence of a given hypothesis may include an analysis of mass hypothesis revisions? spectrometry data for 25 patients with colon cancer and 25 healthy controls followed by clustering, cluster metrics and binary In addition, the following aspects are desirable for flexibility and hypothesis testing. extensibility: 4. A hypothesis history, which points to prior hypotheses that were revised to generate the current one. In our example, a 5. Classification: Does the vocabulary support a taxonomy of hypothesis such as the association of protein EGFR with colon hypothesis statements? cancer SubType A would link back to the original hypothesis 6. Standards: Is the model defined using standards or does it statement that protein EGFR is associated with colon cancer. use proprietary or idiosyncratic formats? DISK represents hypothesis statements as a graph, where the nodes are the entities in the hypotheses and the links are their relationships. In our work, a hypothesis statement is represented 3.2 Models for representing hypotheses in RDF as a simple triple, and the triple is linked to its qualifier, This section introduces different approaches to represent evidence, and history. All those assertions are also made in RDF. hypothesis at different levels of granularity. We group them based The hypothesis evidence and hypothesis history both represent according to the level of detail at which they describe hypotheses: different aspects of provenance for the hypothesis. This is coarse-grained and fine-grained representations. captured using the PROV provenance standard [Lebo et al 2013]. Figure 1 illustrates this representation using the running 3.2.1 Coarse-grained hypothesis models example with protein EGFR. The original hypothesis HG1 had its own statement HS1 and evidence HE1. The revised hypothesis We group under this section those vocabularies that include main HG2 includes its statement HS2, its confidence level L1 (part of concepts to identify hypotheses, but do not include the means to the qualifier HQ2), its evidence HE2, and a link to the original qualify them or describe them at a statement level. For example, hypothesis HG1. A feature of this representation is the ability to popular vocabularies like the Semantic Web for Earth and model different confidence levels associated to a hypothesis Environmental Terminology Ontology1(SWEET) [Raskin and statement. This often happens when evidence is obtained from Pan 2005] contain modules for defining hypotheses as analyzing different types of data and it is unclear how to combine “Experimental Activities”. Likewise, the Ontology for the resulting confidence levels. Figure 2 shows an example. HS3 Biomedical Investigations (OBI)2 [Brandowski et al 2016] and is qualified with two confidence reports (C2 and C3), which have the Ontology for Clinical Research (OCRe)3 [Sim et al 2014] different supporting evidence (HE3 and HE4) each resulting from have concepts to refer to a hypothesis in the context of a a different data source. biological experiment. The DISK hypothesis ontology is available in OWL and Other vocabularies include terms to further describe documented in [Garijo et al 2017]. A major focus of the DISK hypotheses. The EXPO Ontology aims to define a model for hypothesis ontology is capturing hypothesis evolution. The rest of representing scientific experiments, "including generic knowledge this paper focuses on comparing this ontology to other about scientific experimental design, methodology and results representations of scientific hypotheses in the literature. representation" [Soldatova and King, 2006]. The EXPO Ontology extends common upper level ontologies in order to bridge the gap 3 A SURVEY OF HYPOTHESIS between domain specific experiment formalization and upper REPRESENTATIONS level ontologies. EXPO aims at describing scientific papers, and has a specific part designed for the description of hypotheses. The In this section we present a survey of existing models of scientific hypotheses and assess their features to support automated 1 discovery. http://sweet.jpl.nasa.gov/2.3/reprSciModel.owl 2 http://purl.obolibrary.org/obo/OBI_0001908 3 http://purl.org/net/OCRe/OCRe.owl#OCRE400032 3 K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al. focus of EXPO is on how the hypothesis is defined on a research sub:provenance { ##provenance of the assertion graph sub: hypothesisAssertion prov:generatedAtTime "2012-02- paper (the "part of" relationship between the scientific experiment 03T14:38:00Z"^^xsd:dateTime ; and the hypothesis), rather than identifying the statements ex:hasConfidenceReport ex:conf1. prov:wasAttributedTo ex:experimentScientist . contained by the hypothesis itself. However, different classes of ex:conf1 a ex:ConfidenceReport; ex:hasConfidenceLevel "0.6". hypothesis are identified in the ontology (i.e., null hypothesis, prov:wasGeneratedBy ex:execution1. research hypothesis and scientific hypothesis). } sub:pubInfo {##publication information of the user who Finally, the Linked Science Vocabulary 4 proposes a performed the hypothesis lightweight model to express support to hypothesis by some : prov:generatedAtTime "2016-03-26T12:45:00Z"^^xsd:dateTime; prov:wasAttributedTo ex:user1 . research. A hypothesis is represented to make predictions about } facts, but it is not described at a statement level. The ovopublication model proposes a simple approach 3.2.2 Fine grained hypothesis models designed to capture the provenance of assertions [Callahan and Dumontier 2013]. When contrasted with nanopublications, "the We group in this section those approaches that provide the means ovopub is simpler as it consists of only a single named graph with to represent in detail the statements belonging to a hypothesis, key provenance information directly contained in and associated along with their metadata. with the ovopub graph" [Callahan and Dumontier 2013]. LABORS [Soldatova and Rzhetsky 2011] is designed to Ovopublications mix the notion of named graphs with reification to refer to the different components and relationships of the own support investigations run by an automated system for the area of ovopublication. The Ovopub model is integrated as part of the Systems Biology and Functional Genomics. LABORS uses EXPO Semanticscience Integrated Ontology (SIO)7, which also provides as an upper level ontology, and splits the representation of the means to describe hypothesis as literals hypotheses into textual and logical representations, using concepts The Semantic Web Applications in Neuromedicine from OBI and other upper level ontologies. It also allows (SWAN) ontology8 [Ciccarese et al 2008] aims to represent the aggregating hypotheses with multiple statements in hypothesis scientific discourse of bio-medicine papers in general and neuro- sets, using a Datalog representation for each hypothesis statement. medicine papers in particular. The model is composed of several The nanopublication model 5 [Groth et al 2010] aims to modules for representing discourse elements and their represent “the smallest unit of publishable information”, i.e., relationships, different types of agents, the roles, provenance and every assertion that is part of a hypothesis graph. versioning of a given statement and bibliographic references. SWAN was designed to describe statements in papers (along with Nanopublications are composed of three main graphs: An the evidence supporting them). If we consider a hypothesis as a assertion graph containing the assertion or multiple assertions text statement, the following example illustrates the SWAN which are part of the nanopublication, a provenance graph with model: the statements that describe the provenance of the assertion graph (e.g., the assertion graph came from a publication, a scientific @prefix swande: . experiment, etc.); and lastly a publication info graph which @prefix swanco: . contains the metadata about the nanopublication itself. (e.g., who @prefix swanqs: . @prefix swandr: . created, etc.). Each of the graphs is represented using a named @prefix swanpav: . @prefix swanci: . graph,6 so as to be able to describe it properly with metadata from any of the other graphs. An example can be seen in the snippet ex:hypothesis a swande:ResearchStatement ; swande:title "EGFR is associated with colon cancer below, where a hypothesis H1 as in Figure 1 is represented with subtype A"@en; its provenance (sub:provenance), assertion swanco:researchStatementQualifiedAs ; swanci:derivedFrom ex:execution1; ex:hasConfidenceReport ex:c1; @prefix sub: . swanpav:authoredBy ex:experimentScientist; @prefix np: . swanpav:createdOn 2012-02-03T14:38:00Z"^^xsd:dateTime . @prefix prov: . @prefix xsd: . @prefix ex: In the example, a hypothesis is extracted from a research sub:defaultGraph { sub:n1 np:hasAssertion sub: hypothesisAssertion; article. The hypothesis is represented as a statement, which can be np:hasProvenance sub:provenance ; further described with SWAN. The provenance of the hypothesis np:hasPublicationInfo sub:pubInfo ; a np:Nanopublication, ex:Hypothesis . is represented as well by representing the agents who created the } hypothesis statement. sub:hypothesisAssertion {##statements contained in the hypothesis graph ex:EGFR ex:associatedWith ex:ColonCancer .} 4 http://linkedscience.org/lsc/ns/ 5 7 http://www.nanopub.org/nschema# http://semanticscience.org/ontology/sio.owl 6 8 https://www.w3.org/TR/rdf11-concepts/ https://www.w3.org/TR/hcls-swan/ K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al Figure 3: The example from Figure 1 adapted to the micropublication model, following [Clark et al 2014]. The namespaces indicate the ontology used: mp for micropublications, prov for the PROV ontology, and ext for the extension that would need to be added. Finally, micropublications 9 [Clark et al 2014] are derived The lower half of Table 1 corresponds to fine-grained models from the SWAN model and can be considered a refinement of the to describe hypotheses, either defining classes and properties to nanopublication model. Micropublications propose a semantic qualify hypothesis statements with provenance metadata or model of scientific argumentation and evidence that supports relating its different parts together. Among these, the natural language statements, data and materials specifications, nanopublication and micropublication models are the most discussion, etc. Figure 3 shows an illustrative example, where a flexible approaches, compliant with most of the requirements of micropublication uses a mechanism similar to an assertion graph the DISK model (in the last row). LABORS uses a datalog to represent the claim of a protein being associated with a subtype representation for describing hypothesis statements and is domain of colon cancer, along with its supporting evidence. The specific. The ovopublications model is a simplification of the micropublication model uses the Web Annotation Ontology10 to nanopublication model to include provenance of assertions or associate a micropublication and its contents with text from collections of assertions. Although it could be used for hypothesis articles. representation, we consider that the model would need to be thoroughly extended. Similarly, the SWAN model is extended in 4 DISCUSSION the micropublication approach to represent argumentation of facts in publications. Therefore, the nanopublication and Table 1 summarizes the different candidate models for hypothesis micropublication models provide a richer initial framework. representation in automated discovery systems, according to the A major difference between micropublications and features described in Section 3.1. Most models lack support for nanopublications is the scope of the domain. For instance, qualifying a given hypothesis with confidence levels. In order to micropublications was explicitly designed to model facts and overcome this issue, we may follow an approach similar to Figure argumentation of text statements. If an automated discovery 1: extend the target model with a class (confidence Report) and system aims to represent single assertions of hypotheses and their two properties (hasConfidenceReport and hasConfidenceLevel) evolution, then an argumentation framework such as the one linking them together. A reason why the confidence level may not proposed in the micropublication model is not necessary. In be directly linked to a hypothesis is that the same hypothesis may contrast, if the provenance trace includes all evidence to support a be evaluated at different points in time, resulting in multiple particular claim made in a hypothesis, then micropublications are confidence levels with different provenance information each an appropriate model to use. included in a separate confidence report. Another aspect to consider is the support from the The upper half of Table 1 corresponds to the models for communities that are using these models. The nanopublication coarse grained hypothesis representation. These models include a model has been discussed for some time, and has available main concept to refer to a hypothesis, but lack the means to tooling, documentation and examples. 11 The micropublication describe hypothesis statements. Therefore, they do not meet the model has been documented in detail with examples [Clark et al majority of requirements that DISK requires for representing 2014], but it has not yet reached the level of adoption and tooling hypothesis statements, qualifiers, history and evidence. However, that nanopublications have. the LinkedScience, OBI and EXPO vocabularies define different types of hypotheses, and may be potential candidates for reuse if we need to define a hypothesis taxonomy. 9 http://purl.org/mp 10 11 https://www.w3.org/ns/oa http://nanopub.org/ 5 K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al. Table 1: Overview of models for hypothesis representation. Hypothesis Model Hypothesis Hypothesis Hypothesis Hypothesis Hypothesis Use of statement qualifier evidence history classification standards SWEET [Raskin and Pan 2005] No No No No No Yes (OWL) OBI [Brandowski et al 2016] No No No No Yes Yes (OWL) EXPO [Soldatova and King 2006] No No No No Yes Yes (OWL) OCR [Sim et al 2014] No No No No No Yes (OWL) Linked Science Vocabulary No No Partly No No Yes (OWL) LABORS [Soldatova and Rzhetsky No No Yes No Yes Yes (OWL) 2011] Nanopublications [Groth et al 2010] Text/ No Yes Yes No Yes (OWL), structured named graphs Ovopublications [Callahan and Text/ No No Yes No Yes (OWL), Dumontier 2013] structured named graphs SWAN [Ciccarese et al 2008] Text No Yes Yes No Yes (OWL) Micropublications [Clark et al 2014] Text Yes Yes No No Yes (OWL), named graphs DISK [Garijo et al 2017] Structured Yes Yes Yes No Yes (OWL), named graphs Finally, both the nanopublication and micropublication models present an important limitation for representing ACKNOWLEDGMENTS hypotheses: they have been designed to describe simple facts, i.e., We gratefully acknowledge support from the Defense Advanced single statements or a single collection of statements as part of Research Projects Agency through the SIMPLEX program with their claim. In the nanopublication model this is reflected by award W911NF-15-1-0555, and from the National Institutes of having a unique assertion graph per nanopublication, containing Health under award 1R01GM117097. We also thank our one or more statements. If we wanted to describe a hypothesis collaborators in the DISK project, especially Parag Mallick, composed of multiple statements, each with confidence levels Ravali Adusumilli, and Hunter Boyce for their useful feedback on assigned independently by different experiments, we would have this work. to extend the nanopublication model. A possibility may be creating a new class (a hypothesis composition concept such as REFERENCES the “hypotheses-set” in LABORS) that aggregates each of its statements as an individual nanopublication. Likewise, each [Callahan and Dumontier 2013] Alison Callahan and Michel micropublication contains a main claim graph and its support. A Dumontier. Ovopub: Modular data publication with minimal mechanism for extending and aggregating micropublications provenance. arXiv preprint arXiv:1305.6800, 2013. would also be needed to represent hypothesis with multiple [Brandrowski et al 2016] Bandrowski A, Brinkman R, statements. Note that the extension would only be necessary in Brochhausen M, Brush MH, Bug B, et al. (2016) The Ontology both models if we wanted to keep the provenance for each for Biomedical Investigations. PLOS ONE 11(4): e0154556. statement of the hypothesis. Otherwise they can be included in the https://doi.org/10.1371/journal.pone.0154556 assertion graph in the case of nanopublications or the claim graph [Clark et al 2014] Tim Clark, Paolo N. Ciccarese and Carole A. in the case of micropublications. Goble. Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical 5 CONCLUSIONS AND FUTURE WORK communications. Journal of Biomedical Semantics 2014, 5:28. [Ciccarese et al 2008] Ciccarese P, Wu E, Kinoshita J, et al. The In this paper we introduced the DISK hypothesis ontology for SWAN Scientific Discourse Ontology. Journal of biomedical representing hypotheses evolution, which was developed for the informatics. 2008;41(5):739-751. DISK automated discovery system. We also presented a survey doi:10.1016/j.jbi.2008.04.010. of existing vocabularies to represent hypotheses, and assessed [Garijo et al 2017] The DISK Hypothesis Ontology. Version their suitability in the context of automated knowledge discovery. 1.0.0. Available from http://disk-project.org/ontology/disk# Future work includes extending the DISK ontology to align with [Gil et al 2016] Gil, Y.; Garijo, D.; Ratnakar, V.; Mayani, R.; these models. Adusumilli, R.; and Boyce, H. Automated Hypothesis Testing with Large Scientific Data Repositories. In Proceedings of the Fourth Annual Conference on Advances in Cognitive Systems (ACS), pages 1-6, 2016. K-CAP’17 SciKnow, December 2017, Austin, TX, USA Garijo et al. [Gil et al 2017] Gil, Y.; Garijo, D.; Ratnakar, V.; Mayani, R.; Adusumilli, R.; Boyce, H.; Srivastava, A.; and Mallick, P. Towards Continuous Scientific Data Analysis and Hypothesis Evolution. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017. [Groth et al 2010] Groth, Paul; Gibson, Andrew; Velterop, Jan. The anatomy of a nanopublication. Information Services and Use, 30, 1-2: 52-56, 2010. [King 2017] Ross King. The Adam and Eve Robot Scientists for the Automated Discovery of Scientific Knowledge. Bulletin of the American Physical Society, 2017 [Lebo et al 2013] Lebo, T., McGuiness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., and Zhao, J. (2013). The PROV ontology, W3C recommendation. Technical report, World Wide Web Consortium (W3C), 30th April 2013. [Munafo et al 2017] Marcus R. Munafò, Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware & John P. A. Ioannidis. A manifesto for reproducible science. Nature Human Behaviour 1, Article number: 0021 (2017). doi:10.1038/s41562-016-0021 [Pankratius et al 2016] V. Pankratius, J. Li, M. Gowanlock, D. Blair, C. Rude, T. Herring, F. Lind, P. Erickson, C. Lonsdale, Computer-Aided Discovery: Towards Scientific Insight Generation with Machine Support. IEEE Intelligent Systems 31(4), pp. 3-10, Jul/Aug 2016. [Raskin and Pan 2005] Robert G. Raskin and Michael J. Pan. Knowledge representation in the semantic web for Earth and environmental terminology (SWEET). Computers & Geosciences 31(9):1119-1125, November 2005. doi:10.1016/j.cageo.2004.12.004. [Sim et al 2014] Sim I, Tu SW, Carini S, et al. The Ontology of Clinical Research (OCRe): An Informatics Foundation for the Science of Clinical Research. Journal of biomedical informatics. 2014;52:78-91. doi:10.1016/j.jbi.2013.11.002. [Soldatova and King 2006]: Soldatova, LN & King, RD. (2006) An Ontology of Scientific Experiments. Journal of the Royal Society Interface, 3(11):795-803, 2006. doi:10.1098/rsif.2006.0134. [Soldatova and Rzhetsky 2011]: Soldatova, LN and Rzhetsky, A. Representation of research hypotheses. Journal of Biomedical Semantics20112(Suppl 2):S9. 2011. https://doi.org/10.1186/2041-1480-2-S2-S9