-

Proceedings of the 4th Workshop on LISC2014 Linked Science | Making Sense Out of Data

0 (ISWC2014) Riva del Garda , Trentino , Italy 1 Jun Zhao , Marieke van Erp, Carsten Ke ler, Tomi Kauppinen, Jacco van Ossenbruggen, Willem Robert van Hage

2014

38 79

Collocated with the 13th International Semantic Web Conference

Editors:

Preface Preface

Traditionally scientific dissemination has been relying heavily on publications and presentations. The findings reported in these articles are often backed by large amounts of diverse data produced by complex experiments, computer simulations, and observations of physical phenomena. Although publications, methods and datasets are often related, due to this avalanche of data it remains extremely hard to correlate, reuse and leverage scientific data. Semantic Web technologies provide a promising means for publishing, sharing, and interlinking data to facilitate data reuse and the necessary correlation, integration, and synthesis of data across levels of theory, techniques and disciplines. However, even when these data become discoverable and accessible, significant challenges remain in making intelligent understandings of these data and scientific discoveries that we anticipated.

Our past three series (LISC2011, LISC2012 and LISC2013) have seen many novel ideas of using Semantic Web technologies for integrating scientific data (for example about real experiments or from simulations), or enabling reproducibility of research via online tools and Linked Data. The theme for LISC2014 is “Making Sense out of Data Through Linked Science”. Here we focus on new ways of discovering interesting patterns from scientific data, which could lead to research validation or identification of new hypotheses and acceleration of the scientific research cycle. We target both new results through making use of semantic reasoning or making innovative combination of existing technologies (such as visualization, data mining, machine learning, and natural language processing) with SW technologies to enable better understanding of data. One goal is to create both an incentive for scientists to consider the Linked Science approach for their scientific data management and an incentive for technologists from different disciplines to work together towards the vision of powering science with technologies.

LISC2014 was hosted at the 13th International Semantic Web Conference (ISWC2014), in Riva del Garda, Trentino, Italy. Twenty-seven attendees enjoyed the opening keynote “Making more sense out of social data” by Harith Alani (KMI, the Open University, UK), followed by excellent presentations of the eight regular papers collected in these proceedings. We continued the tradition of a “working” workshop with a plenary discussion on the challenges and opportunities of using Semantic Web technologies for sense making. The results of this discussion is published at FigShare, and can be cited as:

Overall, this edition continued providing a successful forum for discussing how semantic web technologies and linked data can help science. We wanted to thank the entire program committee for helping to assemble the program and the attendees for their enthusiastic participation. The LISC 2014 Co-organizers:

Jun Zhao Marieke van Erp

Carsten Keßler

Tomi Kauppinen Jacco van Ossenbruggen Willem Robert van Hage Boyan Brodaric Arne Broering Paolo Ciccarese Oscar Corcho Aba-Sah Dadzie Stefan Dietze Mathieu Daquin Daniel Garijo Alasdair Gray Paul Groth Rinke Hoekstra Krzysztof Janowicz

Program Committee

Simon Jupp Tomi Kauppinen Carsten Keßler James Malone Edgard Marx (additional reviewer) Jeff Pan Heiko Paulheim Marieke van Erp Willem van Hage Jacco van Ossenbruggen Amrapali Zaveri Jun Zhao 1 2 3 4 5 6 7 8

EPUB3 for Integrated and Customizable Representation of a Scienti c Publication and its Associated Resources

Hajar Ghaem Sigarchian, Ben De Meester, Tom De Nies, Ruben Verborgh, Wesley De Neve, Erik Mannens, Rik Van de Walle

Semantic Lenses to Bring Digital and Semantic Publishing Together

Angelo Di Iorio, Silvio Peroni, Fabio Vitali, Jacopo Zingoni

Clustering Citation Distributions for Semantic Categorization and Citation Prediction

Francesco Osborne, Silvio Peroni, Enrico Motta

SMART Protocols: SeMAntic RepresenTation for Experimental Protocols

Olga Giraldo, Alexander Garcia, Oscar Corcho

LinkedPPI: Enabling Intuitive, Integrative Protein-Protein Interaction Discovery

Laleh Kazemzadeh, Maulik R. Kamdar, Oya D. Beyan, Stefan Decker, Frank Barry

Using the Micropublications Ontology and the Open Annotation Data Model to Represent Evidence within a Drug-Drug Interaction Knowledge Base

Jodi Schneider, Paolo Ciccarese, Tim Clark, Richard D. Boyce

Capturing Provenance for a Linkset of Convenience

Simon Jupp, James Malone, Alasdair J. G. Gray

Connecting Science Data Using Semantics and Information Extraction

Evan W. Patton, Deborah L. McGuinness 1 12 24 36 48 60 71 76

EPUB3 for Integrated and Customizable Representation of a Scientific Publication and its Associated Resources

Hajar Ghaem Sigarchian1, Ben De Meester1, Tom De Nies1, Ruben Verborgh1, Wesley De Neve1,2, Erik Mannens1, and Rik Van de Walle1 Abstract. Scientific publications point to many associated resources, including videos, prototypes, slides, and datasets. However, discovering and accessing these resources is not always straightforward: links could be broken, readers may be offline, or the number of associated resources might make it difficult to keep track of the viewing order. In this paper, we explore potential integration of such resources into the digital version of a scientific publication. Specifically, we evaluate the most common scientific publication formats in terms of their capability to implement the desirable attributes of an enhanced publication and to meet the functional goals of an enhanced publication information system: PDF, HTML, EPUB2, and EPUB3. In addition, we present an EPUB3 version of an exemplary publication in the field of computer science, integrating and interlinking an explanatory video and an interactive prototype. Finally, we introduce a demonstrator that is capable of outputting customized scientific publications in EPUB3. By making use of EPUB3 to create an integrated and customizable representation of a scientific publication and its associated resources, we believe that we are able to augment the reading experience of scholarly publications, and thus the effectiveness of scientific communication. 1

Introduction Scientific publications consist of more than only text: they may also point to many associated (binary) resources, including videos, prototypes, slides, and datasets. Yet today, only the access to the text of a scientific publication is straightforward; the associated resources are often more difficult to access. For instance, readers may not always have an Internet connection at their disposal to download related materials, and even when this is the case, links might become broken after a while. Furthermore, given their diverse nature, related materials often need to be accessed in a different reading environment like a standalone media player, causing readers to lose track of the scientific narrative.

The 2007 Brussels Declaration3 by the International Association of Scientific, Technical and Medical (STM) Publishers states that “raw research data should be made freely available” and that “one size fits all solutions will not work”. In this paper, we illustrate that the ability to (adaptively) create an integrated representation of a scientific publication and its associated resources contributes to these goals. Specifically, we evaluate the most common scientific publication formats in terms of their capability to implement the desirable attributes of an enhanced publication and to meet the functional goals of an enhanced publication information system: PDF, HTML, EPUB2, and EPUB3. In addition, we present an EPUB3 version of an exemplary publication in the field of computer science, integrating and interlinking an explanatory video and an interactive prototype. Finally, we introduce a demonstrator that is capable of outputting customized scientific publications in EPUB3.

The rest of this paper is structured as follows. In Section 2, we discuss a number of current best practices among three scientific publishers, focusing on the way open formats and their features are used to enhance scientific publications. Next, in Section 3, we investigate to what extent PDF, HTML, EPUB2, and EPUB3 facilitate the use of enhanced scientific publications and corresponding information systems. In Section 4, we present an exemplary scientific publication in EPUB3 that integrates an explanatory video and an interactive prototype. In Section 5, we introduce our demonstrator for creating customized scientific publications in EPUB3. Finally, in Section 6, we present our conclusions and a number of directions for future work. 2

Current Best Practices In this section, we briefly discuss a number of current best practices among three scientific publishers, focusing on the way open formats are used to make available scientific publications that have been enhanced with multimedia, interactivity, and/or Semantic Web features.

BioMed Central and Hindawi Publishing Corporation: These pub

lishers make scientific publications available in several formats: PDF, HTML, and EPUB2. The HTML version of the publications can for instance be enhanced with reusable data (e.g., supplementary datasets), while the EPUB2 version of the publications just uses links to cited publications in EPUB2 format. However, the publications in question do not contain any embedded interactive multimedia content.

Elsevier: Elsevier makes available different versions of a scientific publication: PDF, HTML, MOBI, and EPUB2. In addition, authors are able to deposit their datasets, making it possible for readers to access and download these datasets [ 1 ]. Moreover, the EPUB2 version of a publication is enriched with direct links to the PDF version of cited publications, thus not embedding these PDF versions into the EPUB2 file. Furthermore, the EPUB2 version of a publication does not contain any embedded interactive multimedia content. 3 http://www.stm-assoc.org/brussels-declaration/

In summary, we can conclude that none of the aforementioned EPUB2 versions – as currently made available by BioMed Central, Hindawi Publishing Corporation, and Elsevier – embed interactive multimedia content for offline usage (i.e., readers need to have network connectivity in order to be able to access all linked resources), nor do they contain Semantic Web features. 3

Comparative Analysis of Publication Formats In recent years, a new open format for distribution and interchange of digital publications has emerged, called EPUB3 [ 6 ]. This format can also be used in the context of scientific publications. In what follows, we investigate to what extent PDF, HTML, EPUB2, and EPUB3 are able to support the properties of an enhanced scientific publication (that is, a scientific publication with multimedia, interactivity, and/or Semantic Web features). To that end, we analyzed a number of desirable attributes of an enhanced publication. Furthermore, we also investigated the functional goals of an enhanced publication information system (that is, the system that facilitates the authoring of enhanced publications).

Thoma et al. [10] defined a core set of nine desirable attributes of an enhanced publication: appearance, page transitions, in-page navigation, image browsing, navigation to an embedded/linked media object, support for interactivity, transmission, embedding and linking of multimedia/interactive objects, and document integrity and structure. In addition, by both considering the attributes defined by Thoma et al. in [10] and a review of five already existing enhanced publications, Adriaansen et al. [ 2 ] identified eleven attributes of an enhanced publication: navigation by table of contents, metadata, links to figures and tables, attached data resources, link from text to references, direct publication links from references, reader comments, download as PDF, interactive content, relations, and cited by. Furthermore, as argued in a talk by Ivan Herman4, bridging online and offline access is a need for high-quality digital books, and consequently for high-quality digital scientific publications, given that offline access enables users to access supplementary information, even when they do not have a network connection at their disposal. As a result, although none of the aforementioned research efforts discusses this aspect, we consider offline access to be a desirable attribute of an enhanced publication as well.

Besides the attributes of enhanced publications, we also considered data model and information system aspects. Bardi et al. [ 3 ] reviewed existing data models for enhanced publications, taking into account structural and semantic features, also proposing a classification scheme for enhanced publication information systems based on their main functional goals. In this context, the authors outline four major scientific motivations that explain the functional goals of an enhanced publication information system: packaging with supplementary material, improving readability and understanding, interlinking with research data, and enabling repetition of experiments. Furthermore, we believe that portability 4 http://www.w3.org/2014/Talks/0411-Seoul-IH/Talk.pdf is also needed in order to preserve the availability of resources and their interlinking, given that it enables users to even access supplementary information in offline situations. Thus, an enhanced publication that has supplementary resources needs to be a self-contained package. Therefore, we identified portable packaged file as another desirable attribute of an enhanced publication.

Finally, according to Liu [8], users are in need of a hybrid solution for print and digital resources. This means that, besides all different digital publication formats, print also remains an important publication medium. As a result, we see suitable for print as another desirable attribute of an enhanced publication.

Ideally, an enhanced publication information system should be able to support all the desirable attributes mentioned above. Considering the desirable attributes of enhanced publications and the functional goals of enhanced publication information systems, we mapped the attributes identified in [ 10,2 ] onto each functional goal identified by Bardi et al. in [ 3 ]. Our mapping can be found in the first and second column of Table 1. We can observe that nearly all desirable attributes of an enhanced publication can be covered by the functional goals of an enhanced publication information system, with the exception of the final three attributes, for which we defined our own functional goals.

Next, we investigated what scientific publication formats are the most promising to cover both the desirable attributes of an enhanced publication and the functional goals of an enhanced publication information system. We have summarized our findings in the four rightmost columns of Table 1. Corresponding explanatory notes can be found below.

Packaging with supplementary material: This functional goal states that it should be possible to add supplementary material to a scientific publication. PDF can embed audio and video but it does not support rich media (e.g., media overlays). As such, it is not a suitable format for embedding various types of associated resources (e.g., interactive content and standalone applications). Consequently, PDF has limited support for this functional goal and its underlying attributes. Note that extensions exist, such as export to a PDF Portfolio in Adobe Acrobat5, that make it possible to combine related materials. However, to the best of our knowledge, none of these extensions for instance allow embedding interactive content and standalone applications. Furthermore, the embedded resources are not reusable, unlike the EPUB3 format, which lets users reuse embedded resources. In order to package research data within an HTML file, all the dependencies need to be packaged as well. While this is possible (e.g., using a zipped folder), there is no standardized approach to do this, as opposed to EPUB2 and EPUB3. Therefore, we do not consider HTML to be suitable for meeting this functional goal. According to the EPUB2 specification [7], EPUB2 cannot embed multimedia and interactive objects. Consequently, EPUB2 also offers limited support for this functional goal. However, in EPUB3, no such restrictions are specified. As a result, we can conclude that EPUB3 is the only format that fully supports this functional goal. 5 http://www.adobe.com/products/acrobat/combine-pdf-files-portfolio.html

Format PDF HTML EPUB2 EPUB3

D D D D

D D*

D D D

D* D*

Attributes – Embedding and linking of mul

timedia/interactive objects pPlaecmkaegnitnagry wmiathtersiaulp- –– tADutortceaucmheedntdainttaegrersitoyuracneds struc– Navigating to an embedded /

linked media object Enabling repetition of – Native support for interactivity experiments – Code execution – Interactive content – Navigation by table of contents – Reader comments – Appearance – Page transitions Improving readability – In-page navigation and understanding – Image browsing – Links to figures and tables – Direct publication links from

references – Cited by Interlinking with re- – Metadata search data – Relations Portable packaged file – Bridging online / offline

– Transmission Suitable for print – Download as PDF

D* D* D

Enabling repetition of experiments: This functional goal aims at enabling

researchers to (re-)execute experiments and/or demonstrators from within a scientific publication. PDF has limited support for scripting and code execution. However, the support available is not sufficient for building small standalone applications that can act as interactive content (e.g., self-contained widgets). As a result, PDF is not suitable for meeting this functional goal. HTML is able to embed code (e.g., JavaScript). Moreover, thanks to the inline frame element (that is, the iframe element), HTML can also be used as an interface to other experiments. As EPUB2 does not support JavaScript, it is not suited for repetition of experiments. However, similar to HTML, EPUB3 supports JavaScript, and thus the aforementioned functional goal (unless experiments are involved that for instance use complex algorithms on clusters to obtain their results).

Improving readability and understanding: PDF is a specific format for

print, and not for screen readers. While still undeniably the most suitable format for print layout, in digital form, it does not have device independence [ 5 ], making it difficult to maintain readability on different screens. According to the PDF specification, it has a limited support for this functional goal. On the other hand, HTML, EPUB2, and EPUB3 are suitable for improving readability and understanding, because they can overcome the aforementioned shortcomings of PDF (cf. the use of reflowable layout). Interlinking with research data: In order to make links between supplementary materials added to publications, (relational) metadata need to be taken into account. PDF has a coarse level of support for metadata (e.g., title and author information), and where these metadata are not related to interlinking supplementary materials. As a result, PDF is not suitable for meeting this functional goal. HTML can be enriched for interlinking purposes using Semantic Web formats and technologies [9] (e.g., RDF and OWL). EPUB2 has limited support for metadata. Furthermore, it does not allow embedding multimedia and interactive content as supplementary research data. Hence, EPUB2 is not suitable for meeting this functional goal. According to the EPUB3 specification, it supports metadata and interlinking of research data. In fact, it retains all functionality of (X)HTML5.

Apart from a suitable format, interlinking supplementary materials requires suitable ontologies. Fortunately, many suitable candidates for general and specific interlinking purposes are already available. For example, schema.org is an ontology that is suitable for use in a variety of domains, including the description of events and creative works. It can thus be used to semantically enhance publications, and it can also be extended by other ontologies. Furthermore, Standard Analytics6 aims at turning scholarly publications into an interface to a web of data, making use of already existing web ontologies. Moreover, Structural, Descriptive, and Referential (SDR)7 is an ontology for representing academic publications, related artifacts (e.g., videos, slides, and datasets), and referential metadata. This ontology can generically define all possible interactive and multimedia resources. In addition, any publication can use general ontologies such as the Citation Typing Ontology (CiTO)8, the Bibliographic Ontology (BIBO)9, and the Common European Research Information Format (CERIF)10. Finally, publications may also need to make use of ontologies that are specific for their research domains (e.g., in the medical domain, the Infectious Disease Ontology (IDO)11 could be used). 6 https://standardanalytics.io/ 7 http://onlinelibrary.wiley.com/doi/10.1002/asi.23007/full 8 http://www.essepuntato.it/lode/http://purl.org/spar/cito 9 http://bibliontology.com/ 10 http://helios-eie.ekt.gr/EIE/bitstream/10442/13864/1/IJMSO_2014_CERIF_ authorFinalVersion.pdf 11 http://infectiousdiseaseontology.org/page/Main_Page Portable packaged file: PDF has limited support for packaging interactive content and standalone applications. Furthermore, it cannot bridge the gap between online and offline usage. Indeed, PDF is an offline format for print, and any interactive parts will not remain after printing a publication. As mentioned before, HTML lacks a proper packaging structure, making this format not a suitable candidate for meeting this functional goal. A similar remark holds regarding EPUB2, as this format does not have support for embedding interactive multimedia resources. As EPUB3 has extensive support for embedding interactive multimedia resources, it can be considered a suitable format for creating portable packaged files. Ideally, users expect that all types of resources can be embedded in a packaged file, regardless of their size. This is one of the shortcomings of EPUB3. Embedding large datasets makes the size of an EPUB3 file potentially very large, causing portability and readability issues. We discuss a possible solution to this issue in Section 5.

Suitable for print: Currently, PDF is the only format suitable for print. Although HTML, EPUB2, and EPUB3 can also be used for the purpose of print, they have been designed for screen readers and can currently not match the high typesetting demands for print publications.

As can be seen in Table 1, EPUB3 is the format that supports most desirable attributes of an enhanced publication and most functional goals of an enhanced publication information system. Only PDF is suitable for print output, given that HTML and EPUB(2/3) have been primarily designed for screen output, typically resulting in a layout that is suboptimal for print. Note that, as a workaround for this problem, the EPUB(2/3) and HTML versions of a publication can embed or link to the PDF version of a publication. 4

Proof-of-Concept: A Scientific Publication in EPUB3 In this section, we demonstrate how EPUB3 can be used to create an integrated representation of a scientific publication and its associated resources. To that end, we enhanced the “Everything is Connected” publication [11] – a paper authored by ourselves and a number of colleagues – embedding an explanatory video and an interactive prototype. The resulting proof-of-concept is available for download12. We used Readium13 as our electronic reading system, since it supports most features of EPUB3. As illustrated by Figure 1, our proof-of-concept shows how a publication can act as an interface to different types of research outputs. Note that, instead of adding a link to the online version of the interactive prototype, we made use of an iframe to allow immediate access to the interactive prototype from within the publication, thus not requiring the reader to make use of a different reading environment. 12 http://multimedialab.elis.ugent.be/users/hghaemsi/EnhancedPublication.

epub 13 http://readium.org/

Furthermore, we semantically enhanced our exemplary EPUB3 publication by making use of schema.org, a general ontology that allows describing books and articles, among other creative works. Thanks to properties such as embedUrl, description, and contentUrl, schema.org makes it possible to indicate how a resource is related to the target EPUB3 publication in a straightforward way. We illustrate this in Figure 2. Note that schema.org is supported by major search engines such as Bing, Google, Yahoo!, and Yandex. However, at the time of writing this paper, the aforementioned search engines did not have support yet for indexing EPUB3 publications (and reading the metadata available within these publications).

Creating Customized EPUB3 Publications In the previous sections, we explained how supplementary materials can be embedded into a scientific publication. As mentioned before, embedding all relevant supplementary materials in a portable packaged file is not always cost-effective and/or desirable for a user. Since the size of an EPUB3 file is dependent on the size of all embedded resources, it will not be lightweight in all use cases, e.g., when embedding large datasets. The problem is that, on the one hand, a packaged file should not face portability and other usage issues relevant to its size. On the other hand, the advantages of having a portable packaged publication are overthrown with the disadvantage of not being able to distribute the entire publication properly. Users may not need all embedded supplementary materials and instead, wish to have their own customized lightweight publication. For instance, we can refer to big datasets or high-resolution images which can be located in a remote repository instead of embedding them in the portable packaged file. An environment for outputting customized publications allows users to select and embed the supplementary materials to the extent that they choose. Hence, they can determine the size of the EPUB3 file themselves. That way, the problem of distributing overly large publications is solved, and only the content that the user needs is distributed. The only disadvantage of this approach is the added complexity at the distribution side (i.e., at the platform of the publisher). However, most publishers already have an extensive online distribution infrastructure, which could easily be expanded with an interface such as the one we propose. For example, publishers such as Elsevier offer different formats of a publication to users. In particular, on the ScienceDirect website of Elsevier, there is an option for the user to select his/her preferred format.

To illustrate this concept of customizable publications, we implemented a basic demonstrator in which a user can first select the relevant supplementary material using a web interface, after which a customized EPUB3 publication is outputted. Figure 3 shows the user interface of our online demonstrator. Content selection is entirely done at the client side, based on the HTML representation of a publication. The selected content is then packaged as an EPUB3 file on the server side. The resulting demonstrator is available online14. Note that the author of a publication can determine which elements are customizable, simply by adding the class customizable to the desired HTML elements.

Ideally, the implemented functionality for outputting customized publications in EPUB3 would be integrated into an authoring environment, where authors and publishers could indicate which elements of a publication are customizable. In previous work, we have implemented such an authoring environment for the collaborative creation of enriched e-Books using EPUB3 [ 4 ]. It allows authors and publishers to create an electronic publication with all required material embedded. Next, this publication can be exported as an EPUB3 file. In future work, we aim to showcase an integrated version of this authoring environment with a customizable distribution platform as described above. In this paper, we demonstrated that the increasingly popular EPUB3 format can be used to create integrated representations of a scientific publication and its associated resources. By doing so, we believe that this contributes to a better reading experience and more effective scientific communication (e.g., support for the inclusion of explanatory videos and interactive prototypes should enable authors to better transfer their knowledge and experience). In addition, we indicated that an EPUB3 version of a scientific publication can be used as a primary version, from which other versions of the scientific publication can be reached (e.g., a PDF version for print), thereby allowing legacy content to persist.

We can identify a number of directions for future research. First, user-friendly authoring tools are needed that allow easily creating enhanced scientific publications, and where these scientific publications can act as an interface to different research outputs. We have already started taking steps in this direction. Second, these authoring tools need to support different output formats, in order to meet the needs of both readers that are reading on paper and readers that are reading digitally. Third, these authoring tools also need to make it possible to easily add metadata to EPUB3 versions of scientific publications, such that EPUB3 versions of scientific papers may have the same degree of discoverability as PDF and HTML versions. Finally, it would be interesting to investigate the good practices of novel publication repositories such as PLOS ONE, Figshare, and ResearchGate. The research activities described in this paper were funded by Ghent University, iMinds (a research institute founded by the Flemish Government), the Institute for Promotion of Innovation by Science and Technology in Flanders (IWT), the FWO-Flanders, and the European Union.

Semantic lenses to bring digital and semantic publishing together

Angelo Di Iorio1, Silvio Peroni1,2, Fabio Vitali1, and Jacopo Zingoni1 Abstract. Modern scholarly publishers are making steps towards semantic publishing, i.e. the use of Web and Semantic Web technologies to represent formally the meaning of a published document by specifying information about it as metadata and to publish them as Open Linked Data. In this paper we introduced a way to use a particular semantic publishing model, called semantic lenses, to semantically enhance a published journal article. In addition, we present the main features of TAL, a prototypical application that enables the navigation and understanding of a scholarly document through these semantic lenses, and we describe the outcomes of a user testing session that demonstrates the efficacy of TAL when addressing tasks requiring deeper understanding and fact-finding on the content of the document. 1

Introduction Simultaneously to the evolution of the Web by means of Semantic Web technologies, modern publishers (and in particular scholarly publishers) are making steps towards the enhancing of digital publications with semantics, an approach that is known as semantic publishing [22]. In brief, semantic publishing is the use of Web and Semantic Web technologies to represent formally the meaning of a published document by specifying a large quantity of information about it as metadata and to publish them as Open Linked Data. As a confirmation of this trend, recently the Nature Publishing Group (publisher of Nature), the American Association for the Advancement of Science (publisher of Science) and the Oxford University Press have all announced initiatives to open their articles’ reference lists and to publish them as Open Linked Data3,4,5. 3 Nature.com Linked Data: http://data.nature.com. 4 http://opencitations.wordpress.com/2012/06/16/science-joins-nature-in-openingreference-citations 5 http://opencitations.wordpress.com/2012/06/22/oxford-university-press-tosupport-open-citations

However, the enhancement of a traditional scientific paper with semantic annotations is not a straightforward operation, since it involves much more than simply making semantically precise statements about named entities within the text. In [17], we have shown how several relevant points of view exist beyond the bare words of a scientific paper – such as the context of the publication, its structural components, its rhetorical structures (e.g. Introduction, Results, Discussion), or the network of citations that connects the publication to its wider context of scholarly works. These points of view are usually combined together to create an effective unit of scholarly communication so well integrated into the paper as a whole and into the rhetorical flow of the natural language of the text, so as to be scarcely discernible as separate entities by the reader. We also propose the separation of these aspects into eight different sets of machine-readable semantic assertions (called semantic lenses), where each set describes one of (from the most contextual to the most document-specific): research context, authors’ contributions and roles, publication context, document structure, rhetoric organisation of discourse, citation network, argumentative characterisation of text, and textual semantics.

How can the theory of semantic lenses be used to extend effectively semantic publishing capabilities of publishers? In order to provide an answer to this question, in this paper we introduce a prototypical HTML interface to scholarly papers called TAL (Through A Lens), which enables the navigation of a text document on which semantic lenses have been applied to make explicit all the corresponding information. This HTML interface is meant to be a proof of concept of the semantic lenses in a real-case scenario. We performed a user testing session that demonstrates the efficacy of TAL when addressing tasks requiring deeper understanding and fact-finding on the content of the document.

The rest of the paper is organised as follows. In Section 2 we introduce some significant works related to semantic publishing experiences and models. In Section 3 we show an application of semantic lenses onto a particular scholarly article. In Section 4 we introduce TAL describing its main features, while in Section 5 we discuss the outcomes of a user testing session we performed to assess the usability and effectiveness of TAL. Finally (Section 6) we conclude the paper sketching out some future works. 2

Related works Much current literature concerns both the proofs of concepts for semantic publishing applications and the models for the description of digital publishing from different perspective. Because of this richness, here we present just some of the most important and significant works on these topics.

In [22], Shotton et al. describe their experience in enriching and providing appropriate Web interfaces for scholarly papers enhanced with provenance informations, scientific data, bibliographic references, interactive maps and tables, with the intention to highlights the advantages of semantic publishing to a broader audience. Along the same lines, in their work [19] Pettifer et al. introduce pros and cons of the various formats for the publication of scholarly articles and propose an application for the semantic enhancement of PDF documents according to established ontologies.

A number of vocabularies for the description of research projects and related entities have been developed, e.g. the VIVO Ontology6 – developed for describing the social networks of academics, their research and teaching activities, their expertise, and their relationships to information resources –, the Description Of A Project7 – an ontology with multi-lingual definitions that contains terms specific for software development projects – and the Research Object suite of ontologies [ 1 ] – for linking together scientific workflows, the provenance of their executions, interconnections between workflows and related resources (datasets, publications, etc.), and social aspects related to such scientific experiments.

One of the most widely used ontology for describing bibliographic entities and their aggregations is BIBO, the Bibliographic Ontology [ 3 ]. FRBR, Functional Requirements for Bibliographic Records [10], is yet another more structured model for describing documents and their evolution in time. One of the most important aspects of FRBR is the fact that it is not tied to a particular metadata schema or implementation.

Several works have been proposed in the past to model the rhetoric and argumentation of papers. For instance, the SALT application [9] permits someone such as the author “to enrich the document with formal descriptions of claims, supports and rhetorical relation as part of their writing process”. There are other works, based on [23], that offer an application of Toulmin’s model within specific scholarly domains, for instance the legal and legislative domain [11]. A good review of all the others Semantic Web models for the description of arguments can be found in [21]. 3

The Semantic Lenses In [17], we claimed that the semantics of a document is definable from different perspectives, where each perspective is represented as a semantic lens that is applied to a document to reveal a particular semantic facet. In this section we briefly summarise our theory. A full example of the lenses applied to a well-known paper Ontologies are us: A unified model of social networks and semantics [14] is available at http://www.essepuntato.it/lisc2014/lens-example.

Lenses are formalised in the LAO ontology8. In addition, since the application of the semantic lenses to a document is an authorial activity, i.e. the action of a person (the original author as well as anyone else) taking responsibility for a semantic interpretation of the document, we also record the provenance of the semantic statements according to the Provenance Ontology (PROV-O) [12].

Figure 1 summarises the overall conceptual framework. The lenses are organised in two groups: context -related, which describe the elements contributing to 6 VIVO Ontology: http://vivoweb.org/ontology/core 7 DOAP: http://usefulinc.com/ns/doap 8 Lens Application Ontology (LAO): http://www.essepuntato.it/2011/03/lens. the creation and development of a paper, and content -related, which describe the content itself of the paper from different angles. Writing a scientific paper is usually the final stage of an often complex collaborative and multi-domain activity of undertaking the research investigation from which the paper arises. The organizations involved, the people affiliated to these organizations and their roles and contributions, the grants provided by funding agencies, the research projects funded by such grants, the social context in which a scientific paper is written, the venue within which a paper appears: all these provide the research context that leads, directly or indirectly, to the genesis of the paper, and awareness of these may have a strong impact on the credibility and authoritativeness of its scientific content.

Three lenses are designed to cover these aspects: – Research context: the background from which the paper emerged (the research described, the institutions involved, the sources of funding, etc.). To describe such contextual environment we use FRAPO, the Funding, Research Administration and Projects Ontology9. – Contributions and roles: the individuals claiming authorship on the paper and what specific contributions each made. We use SCoRO (the Scholarly Contributions and Roles Ontology10) and its imported ontology PRO (the Publishing Roles Ontology11) [18] to describe these aspects. 9 FRAPO: http://purl.org/cerif/frapo 10 SCoRO: http://purl.org/spar/scoro 11 PRO: http://purl.org/spar/pro LinkedPPI: Protein-Protein Interaction Discovery rate-limiting step to our data-warehousing approach for centralised analysis. We have proposed a domain-specific model which can accommodate the needs in the field of PPI modelling. The use of a domain-specific model and an interactive graph-based exploration platform for search and aggregative visualisation makes our integration approach more intuitive for the actual users who deal with PPI predictions. We have also proposed a set of three user scenarios depicting how LinkedPPI framework could be used for the prediction of potential interactions between proteins, domains and genomic regions. 6

Future Work The approach which has been presented in this work is used in extraction of valuable information with regard to PPI network, domain-domain interactions and selective genomic interactions. However the observations reported in the outcome of such data retrieval is raw and could be a valuable asset for simulations and prediction methods if further analysis is done. As part of the future work we intend to apply statistical analysis on significance of such observations in order to be able to develop a classifier algorithm which is able to predict interacting and non-interacting protein pairs.

Acknowledgements This work has been done under the Simulation Science program at the National University of Ireland, Galway. SimSci is funded by the Higher Education Authority under the program for Research in Third-level Institutions and co-funded under the European Regional Development fund.

Using the Micropublications ontology and the Open Annotation Data Model to represent evidence within a drug-drug interaction knowledge base

Jodi Schneider1, Paolo Ciccarese2, Tim Clark2, and Richard D. Boyce3 1 INRIA Sophia Antipolis France

jodi.schneider@inria.fr 2 Massachusetts General Hospital and Harvard Medical School paolo.ciccarese@gmail.com; tim clark@harvard.edu 3 University of Pittsburgh

rdb20@pitt.edu Abstract. Semantic web technologies can support the rapid and transparent validation of scientific claims by interconnecting the assumptions and evidence used to support or challenge assertions. One important application domain is medication safety, where more efficient acquisition, representation, and synthesis of evidence about potential drug-drug interactions is needed. Potential drug-drug interactions (PDDIs), defined as two or more drugs for which an interaction is known to be possible, are a significant source of preventable drug-related harm. The combination of poor quality evidence on PDDIs, and a general lack of PDDI knowledge by prescribers, results in many thousands of preventable medication errors each year. While many sources of PDDI evidence exist to help improve prescriber knowledge, they are not concordant in their coverage, accuracy, and agreement. The goal of this project is to research and develop core components of a new model that supports more efficient acquisition, representation, and synthesis of evidence about potential drug-drug interactions. Two Semantic Web models—the Micropublications Ontology and the Open Annotation Data Model—have great potential to provide linkages from PDDI assertions to their supporting evidence: statements in source documents that mention data, materials, and methods. In this paper, we describe the context and goals of our work, propose competency questions for a dynamic PDDI evidence base, outline our new knowledge representation model for PDDIs, and discuss the challenges and potential of our approach. 1

Introduction Scientific knowledge depends on the verification and integration of large systems of interconnected assertions, assumptions, and evidence. These systems are continually growing and changing, as new scientific studies are completed and new documents are published. The state of current knowledge in any given domain can be difficult for any one individual to fully grasp, because bits of knowledge are updated at frequent intervals.

In the biosciences, this problem has taken on particular importance, due to an exponential growth in the aggregate publication rate. Manually curated databases are used to record certain types of knowledge. To update and maintain these databases, curators must make knowledge-intensive decisions, identifying the best available evidence in the current scientific literature. Maintaining such databases is challenging because there is limited tracking of the source information.

In an ongoing project, we are experimenting with using the Micropublications Ontology4 [Clark2014] and the Open Annotation Data Model5 [W3C2013] to create an audit trail between assertions, evidence, and source documents, so that assertions and evidence can be flagged for update in flexible and intelligent ways. Updates may be needed when the underlying sources change, when a particular method for establishing an assertion is discredited, etc. Our goal is to provide better linkages between an assertion recorded in a knowledge base and its supporting evidence (i.e., data, materials, and methods) found in source documents.

In the remainder of the paper, we describe the competency questions for our evidence base and the new evidence model that we are creating, which combines the Micropublication Ontology and the Open Annotation Data Model, and adapts them to the existing evidence modeling of the Drug Interaction Knowledge Base6 [Boyce2007,Boyce2009]. We then reflect on how the new model performs for our goal of creating an audit trail between assertions, evidence, and source documents. 2

Context and goals Our work is in the context of a larger project on organizing and synthesizing scientific evidence from the biomedical literature on potential drug-drug interactions. Potential drug-drug interactions (PDDIs), defined as two or more drugs for which an interaction is known to be possible, are a significant source of preventable drug-related harm (i.e., adverse drug events, or ADEs). The combination of poor quality evidence on PDDIs, and a general lack of PDDI knowledge by prescribers, results in many thousands of preventable medication errors each year. While many sources of PDDI evidence exist to help improve prescriber knowledge, they are not concordant in their coverage [Saverno2011], accuracy [Wang2010], and agreement [Abarca2003]. Difficulties with synthesizing evidence, and gaps in the scientific knowledge of PDDI clinical relevance, underlie such disagreement. 4 http://purl.org/mp/ 5 http://www.openannotation.org/spec/core/ 6 http://purl.net/net/drug-interaction-knowledge-base/

To address these problems, our research group is studying the potential benefit of applying recent developments from the Semantic Web community on scientific discourse modeling and open annotation. The goal is to develop core components of a new PDDI knowledge representation model that will support a more efficient acquisition, representation, and synthesis of PDDI evidence. The desired knowledge representation will provide better linkages between PDDI assertions and their supporting evidence, by directly connecting to annotated section(s) of relevant source documents. 3

Approach Our new approach will draw upon the current version (1.2) of the Drug Interaction Knowledge Base [Boyce2007,Boyce2009], the Open Annotation Data Model [W3C2013], and the Micropublications Ontology [Clark2014].

The Drug Interaction Knowledge Base (DIKB) is a static, manually constructed evidence base that indexes assertions and evidence of PDDI for over 60 drugs. Its taxonomy of assertion types and evidence types [Boyce2014] is a starting point for the new knowledge base. The current version of the DIKB implements a version of the SWAN semantic discourse ontology [Ciccarese2008] to represent evidence relations. Specifically, the knowledge base uses swanco:citesAsSupportingEvidence and swanco:citesAsRefutingEvidence to link to an entire source document as a supporting or refuting citation. At the time the DIKB 1.2 was constructed (2007–2009), annotation methodologies were less well developed. Consequently, version 1.2 of the DIKB stores quotes as textual strings manually copied from source documents. The text has been enriched with metadata about the source section, but it is non-trivial to return to the appropriate segment of the text from this information.

Our use of the Open Annotation Data Model (OA) reflects a change in the state of the art. OA is an “an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web”7. In particular, OA allows an evidence database to provide explicit connections from quotes to their source documents. For example, as shown in Figure 1, an OA resource can be used to quote a specific part of a drug product label (also known as a summary of product characteristics) to indicate evidence that escitalopram inhibits CYP2D6. In general, OA enables queryable links between selections from source documents (as target) to the instances of data, methods, and materials (as body) that we want to model to support drug interaction knowledge base use cases.

Similarly, the Micropublications Ontology improves the depth with which evidence can be represented and queried. The most important feature of the Micropublications model, in our view, is its ability to represent the data, methods, and materials that act as support for a claim, and to transitively close chains of claims8 and citations across the literature to their fundamental supporting evidence. A mp:Micropublication mp:argues a mp:Claim based on connecting any number of mp:Representations. The whole Micropublication is a Representation, as are Data and Methods (including Materials and Procedures), whether textual or pictoral. A mp:Representation may mp:support or mp:challenge any other mp:Representation, making the evidence explicit and queryable. 4 To design an appropriate enhancement of the DIKB model with Micropublications and the Annotation Ontology, we need to understand what sorts of questions experts would like to retrieve about the PDDIs. The competency questions below were elicited from experienced editors of clinically oriented drug compendia during the process of developing DIKB 1.2. Most fall into three categories: finding assertions and evidence; assessing the evidence; and enabling updates. A second area of interest is statistical information about the evidence base which is useful for various analytics related to knowledge base maintainance. 4.1

Finding assertions and evidence 1. Understanding evidence coming from a given study: (a) What data, methods, materials, are reported in evidence item X? (b) Which evidence items are related to and follow-up on evidence item X? (c) Which research group conducted the study used for evidence item X? (d) Are the evidence use assumptions for evidence item X concordant? unique? non-ambiguous? 8 ‘Assertion’ in DIKB terminology corresponds to a ‘Claim’ in the Micropublications model; this variation in terms is because the term ‘claim’ is used in a different sense in medical billing. 2. Verifying plausibility of an evidence item: (a) Has evidence item X been rejected for assertion Y? If so, why and by whom? (b) Which other assertions are being supported/challenged by this evidence item? (c) What are the assumptions required for use of this evidence item to support/refute assertion X? 3. Checking assertions about pharmacokinetic parameters (i.e., area under the concentration time curve (AUC)) (a) How many pharmacokinetic studies used for evidence items in the DIKB could be used to support or refute an assertion about pharmacokinetic paramater X (e.g., ‘X increases AUC’)? (b) How many pharmacokinetic studies in the DIKB used for evidence items for assertion X are based on data from the product label? (c) What is the result of averaging (or applying some other statistical operation) to the values for pharmacokinetic parameter X across all relevant studies used for evidence items? 4. Checking for differences in the product labeling: (a) Are there differences in the evidence items that were identified across different versions of product labeling for the same drug? (b) What version of product labeling was used for evidence item X? Original manufacturer or repackager? Most current label or outdated? Is the drug on market in country X or not? American or country X?

Supporting updates to evidence and assertions 1. Changing status of redundant and refuted evidence: (a) Remove a older version of a redundant evidence item (b) Change the modality of a supporting evidence item to be a refuting evidence item

2. Updating when key sources change:

(a) Get all assertions that are supported by evidence items identified from an FDA guidance or other source document just released as an updated version. 4.4

Understanding the evidence base 1. Statistical information about the evidence base: (a) Number of assertions in the system (b) Number of evidence items for and against each assertion type (c) Show the distribution of the levels of evidence for various assertion types (e.g., pharmacokinetic assertions) 5 Modeling evidence about drug-drug interactions would represent some of the evidence supporting and challenging the assertion escitalopram does not inhibit CYP2D6. We created the example by hand using a sample assertion and evidence items from the DIKB version 1.29. hasAttribution

RDB May 14

MICROPUBLICATION represents Escitalopram does not inhibit CYP2D6

qualifies In vitro studies did not reveal an inhibitory effect of escitalopram on CYP2D6.

Steady state levels of racemic citalopram were not significantly different in poor metabolizers and extensive CYP2D6 metabolizers after multiple-dose administration of citalopram, suggesting that coadministration, with escitalopram, of a drug that inhibits CYP2D6, is unlikely to have clinically significant effects on escitalopram metabolism.

There are limited in vivo data suggesting a modest CYP2D6 inhibitory effect for escitalopram.

Coadministration of escitalopram, a substrate for CYP2D6, resulted in a 40% increase in Cmax and a 100% increase inAUC of desipramine.

Coadministration of escitalopram (20 mg/day for 21 days) with the tricyclic antidepressant desipramine (single dose of 50 mg), LEXAPRO (escitalopram oxalate) tablet. Forest Labs. 12/2012

Ref 1

SQ1 SQ2 SQ3 qualifies qualifies qualifies SQ5 qualifies

SQ6 qualifies SQ4

dikbEvidence: Non-traceable drug label evidence

DRON:0001858 dikb:does_not_inhibit

PRO:00006121 dikbEvidence: EV_Data_CT_DDI dikbEvidence: EV_Method_CT_DDI supports challenges

MP1 argues

C1 supports

S1 S2 S3 D1 Me 1 supports supports Fig. 2. A model of the evidence for and against the assertion escitalopram does not inhibit CYP2D6. This is based on the Micropublications ontology, and reuses the evidence taxonomy (dikbEvidence), terms (dikb), and data from the DIKB. The Drug Ontology (DRON) and Protein Ontology (PRO) are reused in semantic qualifiers. A more detailed view of Method Me1 is shown in Figure 1.

The Micropublications ontology is used to structure the evidence relating to data, methods, and materials, and the overall indication that evidence mp:supports or mp:challenges a mp:Claim. We qualify Claims (C1 in the figure) by reusing identifiers from DRON10 [Hanna2013] and the Protein Ontology11 [Natale2011]. The new model reuses the DIKB evidence taxonomy12 to provide epistemic qualification (SQ2, SQ5, SQ6 in the figure) to statements (S1, S2, and S3 in the figure), data (D1 in the figure), methods (Me1 in the figure), and materials (not shown in this example). The Open Annotation Data Model (previously shown in Figure 1) is used to link quotes taken from source documents back to their originating information artifacts. The approach to modeling other DIKB assertions would be similar to this example. 6 Certain benefits accrue from upgrading from the current DIKB. Many of the competency questions (Section 4) are not supported in the DIKB 1.2. The new model is designed to support these and additional questions relevant in the domain. Visual inspection of the model suggests that we will be able to answer some competency questions quite naturally. In particular, finding the assertions that are not supported by evidence already in the evidence base, the evidence that should be checked most thoroughly (e.g. evidence that by itself supports multiple assertions), and the data, methods, and materials associated with a given evidence item as described in source documents.

Further, as a Linked Data resource, our new knowledge base will also enable innovative queries using knowledge from other sources about tagged entities (i.e., drugs and proteins) represented in the evidence base. Unlike the current DIKB, we will be able to render annotations in their original context. We also expect to be able to support distributed community annotation/curation, since MP and OA take account of provenance, and since OA is being increasingly adopted by a variety of annotation tools. Our project does raise certain modeling challenges. To date, MP has not been used to represent both unstructured claims and the related logical sentences. Figure 1 shows the assertion escitalopram does not inhibit CYP2D6 as unstructured text. However, the DIKB requires that 1) assertions about PDDIs be formulated by experts prior to collecting evidence, and 2) that the assertions be represented both as unstructured statements and sentences in a logical formalism. Careful thought is being put into how to properly accommodate this use case. Such challenges are to be expected since MP is a relatively new ontology and since this is a new application of it.

Another challenge is to ensure that, as the evidence base scales, competency questions can be answered efficiently. To address this, we building the model using an iterative design-and-test approach. In this process, efficient querying is a key requirement. 6.3

Other issues

For enabling synthesis over the PDDI information, the model is not the only concern. Applying this model will require integration work. One challenge is inherent to scholarly documents: the existing evidence items within the DIKB refer to many data, materials, and methods that exist only in PDF documents accessible only through proprietary portals or academic library systems. Consequently, resolving annotations requires a method for pointing to proprietary oa:target s. 7

Conclusions & Future Work of the art from scientific documents. The knowledge representations we are now creating will be beneficial for integrating PDDI evidence, and we hope they will inspire an increased use of linked data for evidence synthesis in other domains. Acknowledgments This work was carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 246016, and a grant from the National Library of Medicine (1R01LM011838-01). We thank Carol Collins, Lisa Hines, and John R Horn for serving on the Evidence Panel of “Addressing PDDI Evidence Gaps”, and for contributing to the competency questions presented here. [Abarca2003] Abarca, Jacob, Daniel C. Malone, Edward P. Armstrong, Amy J. Grizzle, Philip D. Hansten, Robin C. Van Bergen, and Richard B. Lipton. “Concordance of severity ratings provided in four drug interaction compendia.” Journal of the American Pharmacists Association 44;2 (2003): 136–141. [Boyce2014] Boyce, R.D. “A Draft Evidence Taxonomy and Inclusion Criteria for the Drug Interaction Knowledge Base.” August 9, 2014, url: http://purl.net/net/druginteraction-knowledge-base/evidence-types-and-inclusion-criteria [Boyce2007] Boyce, Richard D., Carol Collins, John Horn, and Ira Kalet. “Modeling Drug Mechanism Knowledge Using Evidence and Truth Maintenance.” IEEE Transactions on Information Technology in Biomedicine 11;4 (2007): 386–397. [Boyce2009] Boyce, Richard D., Carol Collins, John Horn, and Ira Kalet. “Computing with evidence: Part I: A drug-mechanism evidence taxonomy oriented toward confidence assignment.” Journal of Biomedical Informatics 42;6 (2009): 979–989. [Ciccarese2008] Ciccarese, Paolo N., Elizabeth Wu, Gwen Wong, Marco Ocana, June Kinoshita, Alan Ruttenberg, and Tim Clark. “The SWAN biomedical discourse ontology.” Journal of Biomedical Informatics 41;5 (2008): 739–751. [Ciccarese2014] Ciccarese, Paolo N., Marco Ocana, and Tim Clark. “Open semantic annotation of scientific publications using DOMEO.” Journal of Biomedical Semantics Apr 24;3 (2012): Suppl 1:S1. [Clark2014] Clark, Tim, Paolo N. Ciccarese, and Carole A. Goble. “Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications.” Journal of Biomedical Semantics 5;28 (2014). [Hanna2013] Hanna, Josh, Eric Joseph, Mathias Brochhausen, and William R. Hogan. “Building a drug ontology based on RxNorm and other sources.” Journal of Biomedical Semantics 4 (2013): 44–52. [Natale2011] Natale, Darren A., Cecilia N. Arighi, Winona C. Barker, Judith A. Blake, Carol J. Bult, Michael Caudy, Harold J. Drabkin, Peter D’Eustachio, Alexei V. Evsikov, Hongzhan Huang, Jules Nchoutmboube, Natalia V. Roberts, Barry Smith, Jian Zhang and Cathy H. Wu. “The Protein Ontology: a structured representation of protein forms and complexes.” Nucleic acids research 39, no. suppl 1 (2011): D539– D545. [Saverno2011] Saverno, Kim R., Lisa E. Hines, Terri L. Warholak, Amy J. Grizzle, Lauren Babits, Courtney Clark, Ann M. Taylor, and Daniel C. Malone. “Ability of pharmacy clinical decision-support software to alert users about clinically important drug-drug interactions.” Journal of the American Medical Informatics Association 18;1 (2011): 32–37. [Wang2010] Wang, Lorraine M., Maple Wong, James M. Lightwood, and Christine M. Cheng. “Black box warning contraindicated comedications: concordance among three major drug interaction screening programs.” Annals of Pharmacotherapy 44; 1 (2010): 28–34. [W3C2013] Sanderson, Rob, Paolo N. Ciccarese, and Herbert Van de Sompel (editors). “Open Annotation Data Model”, W3C Community Group Draft, 08 February 2013, url: http://www.openannotation.org/spec/core/

Capturing Provenance for a Linkset of Convenience

Simon Jupp1, James Malone1, and Alasdair J G Gray2 1 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom 2 Department of Computer Science, Heriot-Watt University, Edinburgh, United

Kingdom Abstract. Biological interactions such as those between genes and proteins are complex and require intricate OWL models. However, direct links between biological entities can support search and data integration. In this paper we introduce linksets of convenience that capture these direct links. We show the provenance statements required to track the derivation of such linksets; linking them back to the full biological justification.

Keywords: Data linking, Provenance, VoID 1 Investigating biological systems, such as those implicated in disease, necessitates the connection of many levels of biology; gene, gene variation, gene expression, protein structure, signalling pathways, phenotypic, epidemiological data and so on. The ability to integrate data across these levels relies on links that can be formed between biological entities, for example, going from a gene to proteins or proteins to pathways. For each of these links there is some biological justification that may involve several steps (see Section 2 for details). To support tasks such as search and data integration it is convenient to provide additional shortcuts in the form of a direct link, e.g. genes to pathways.

Modeling the true nature of the links using semantic web technologies such as OWL removes ambiguity when working with data by giving it a well defined and precise semantics. However it increases the complexity of interacting with the data as the OWL model needs to capture the full intricacies of the biological interactions. As we move to publish biological data as linked open data, there is an opportunity to describe direct links between different types of biological entities as a shortcut to be made between entities which feature in common queries, such as gene to protein; capturing the way that biologists often discuss the domain and enable novel integrations of the data. These direct links provide a working notion that cuts through the biology but which does not necessitate capturing (or recapturing) the complex multivariate relationships that can hold between the two entities. Such linksets are already used to support the Open

Ensembl Exon so:exon

so:has_part so:gene so:transcript

so:polypeptide Ensembl Gene so:transcribed_from

Ensembl Transcript Ensembl Protein so:translates_to :ep2upRelation uniprot:Protein PHACTS Discovery Platform [ 1 ], although those linksets do not have adequate provenance.

In this paper we propose a mechanism to model these links of convenience using a combination of VoID linksets [ 2 ] and PROV [ 3 ]. We avoid misrepresenting links by applying semantically weaker relationships together with additional provenance which represents the underlying complexity. We illustrate the model with an example using data from two popular biological databases. 2

Linking genes to proteins use case.

We motivate our work with an example mapping between Ensembl [ 4 ] (a database of genome annotation) and Uniprot [ 5 ] (a database of protein sequences). These databases already contain cross-references between an Ensembl Gene (EG) and a Uniprot Protein (UP). However to understand how this mapping is generated you currently need to discover the correct publications and online documentation; they are not directly discoverable from the data.

Biological theory tells us that a gene encodes for a protein, although this biological relation only truly holds for the link between the EG and the Ensembl Protein (EP) entity. There are in fact multiple types of UP to EP mappings, for instance they can be derived from an exact sequence identity or they might be based on a percentage sequence identity. Figure 1 illustrates how we model EG to EP using terminology defined in the Sequence Ontology, and for illustration we include a superproperty of the all the EP to UP mappings that we call ep2upRelation3. We introduce a link of convenience (dashed line) that links the EG to UP that is there to support queries using the semantically weak skos:related relation. This schema lacks the provenance to assert that the related link of convenience is derived from the longer chain of semantically richer links that hold from a gene to protein. 3 UniProt are currently extending their vocabulary to define these relations. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 # define the ensembl protein partition :ensembl void:classPartition :EPpartition . :EPpartition void:class so:Polypeptide . # define the Uniprot protein partition :uniprot void:classPartition :UPpartition . :UPpartition void:class uniprot:Protein . # define the linkset that links the two partitions :ensemblProteinToUniprotProteinLinkset a void:Linkset ;

void:linkPredicate :ep2upRelation ; # define partitions for ensembl gene, gene transcript and # transcript protein :ensembl void:classPartition :ensemblGenePartition ; void:propertyPartition :ensemblGeneTranscriptPartition ; void:propertyPartition :ensemblTranscriptProteinPartition ; :ensemblGenePartition void:class so:gene . :ensemblGeneTranscriptPartition void:property so:transcribed_from . :ensemblTranscriptProteinPartition void:property so:translates_to . # define the linkset that links the two partitions, # including the dataset description that contains the triples that # are used to derive this linkset :ensemblGeneToUniprotProteinLinkset a void:Linkset ; void:linkPredicate skos:related ; void:subjectsTarget :ensemblGenePartition; void:objectsTarget :UPpartition; prov:wasDerivedFrom :ensemblGeneTranscriptPartition, :ensemblTranscriptProteinPartition, :ensemblProteinToUniprotProteinLinkset The model outlined in Figure 1 can be decorated with provenance that captures additional information about how the link of convenience between EG and UP is derived. The resulting linkset description is shown in Figure 2. In the following we describe the blocks of RDF.

The VoID vocabulary of linked datasets allows the description of RDF links between datasets using VoID linksets. A linkset allows us to describe the links, captured as a set of triples, between two datasets. We can use VoID to describe relevant partitions of the datasets based on individual properties or classes, these form new subsets that can participate in multiple linksets. In our scenario we need to capture two crucial linksets; the first is the EP to UP linkset, and the second is the more convenient EG to UP linkset.

The EP-UP linkset captures the :ep2upRelation link between types of EP in the Ensembl dataset, and types of UP in the UniProt dataset (lines 10-11). We describe two further subsets; the EP partition of all entities that are of type so:Polypeptide in the Ensembl dataset (lines 2-3) and the UniProt subset of all entities that are of type uniprot:Protein (lines 6-7).

The EG to UP link of convenience needs a similar linkset description based on an EG partition and the previous UP partition, although this time the relation is skos:related (lines 25-26). We also want to capture that the triples in this linkset are derived from another set of triples. This captures that the skos:related is a shortcut relation for a more complex path through the RDF graph. Again we can use VoID partitioning, but this time using a property based partition to identify the EG to Ensembl Transcript (ET) and ET to EP links (lines 15-20) . Finally we use the prov:wasDerivedFrom relation to link the convenience linkset to the linksets that describe the full path of relations that the shortcut represents (line 28-30). 4

Discusion It is always important to try and model your data as accurately as possible, and publishing data with RDF and OWL is well suited for this task. The VoID vocabulary already provides a mechanism to define and attach provenance to linksets between datasets, and we are proposing the use of PROV to connect linksets that are derived from other linksets. As a Web of linked biological data emerges, there is a need to identify links that are there for convenience, and expose how they relate back to the core biological (OWL) model. In cases where a link of convenience is derived from a series of other linksets, it is desirable to be able to spot this and unpack the convenience links using common queries. The model proposed supports this task but questions remain as to whether VoID and PROV are enough, so we hope this preliminary work can help motivtate the discussion.

Acknowledgements

EBI contribution supported by EU FP7 BioMedBridges Grant 284209. References 1. Gray, A.J.G., Groth, P., Loizou, A., Askjaer, S., Brenninkmeijer, C.Y.A., Burger, K., Chichester, C., Evelo, C.T., Goble, C.A., Harland, L., Pettifer, S., Thompson, M., Waagmeester, A., Williams, A.J.: Applying linked data approaches to pharmacology: Architectural decisions and implementation. Semant. Web 5 (2014) 101–113 2. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets with the VoID Vocabulary. Note, W3C (March 2011) 3. Lebo, T., Sahoo, S.S., Mcguinness, D.: PROV-O: The PROV Ontology. Technical report, W3C Recommendation (2013) http://www.w3.org/TR/prov-o/. 4. Flicek, P., Amode, M.R., Barrell, D., et al: Ensembl 2014. Nucleic acids research 42 (2014) D749–D755 doi: 10.1093/nar/gkt1196. 5. The UniProt Consortium: Activities at the universal protein resource (UniProt).

Nucleic acids research 42 (2014) D191–D198 doi: 10.1093/nar/gkt1140.

Connecting Science Data Using Semantics and Information Extraction

Evan W. Patton and Deborah L. McGuinness

Rensselaer Polytechnic Institute 110 8th Street, Troy, NY 12180 USA

{pattoe, dlm}@cs.rpi.edu Abstract. We are developing prototypes that explicate our vision of connecting personal medical data to scientific literature as well as to emerging grey literature (e.g., community forums) to help people find and understand information relevant to complex medical journeys. We focus on robust combinations of natural language processing along with linked data and knowledge representation to build knowledge graphs that help people make sense of current conditions and enable new manners of scientific hypothesis generation. We present our work in the context of a breast cancer use case. We discuss the benefits of biomedical linked data resources and describe some potential assistive technology for navigating rich, diverse medical content.

Keywords: knowledge representation, explanation, clinical notes, natural language, web forums, nanopublications 1 As scientific knowledge continues to grow in size and diversity, it is increasingly difficult to discover and manage information relevant to any particular context. It can be challenging to determine how a statement or report relates to others and to form and evaluate (often competing) hypotheses, e.g. related to diagnosis or treatment paths. Complications grow when content is both structured and unstructured, and when some is from less accredited sources. We aim to expand the boundaries of Linked Science by focusing on evidence modeling from natural language processing techniques (NLP) over broad content and by identifying promising data-driven hypotheses using linked data and nanopublication style encodings. We present this discussion in the context of a breast cancer demonstration use case informed by challenges experienced during a co-author’s recent cancer journey. Cancer is a complex disease to manage and treat, often requiring chemotherapy, surgery, radiation, and drugs to reduce recurrence. We show how management of this information by the patient is aided by semantic technologies combined with natural language processing algorithms.

A breast cancer patient wishes to better understand her diagnosis and planned treatment. She is interested in expected chemotherapy side effects, and leveraging experiences of other similar individuals to proactively find and evaluate promising coping strategies. She reads through oncologist-provided documents about her proposed chemotherapy drugs and uses search engines to find more about likely adverse effects that appear detrimental to her quality of life. She finds conflicting opinions on the efficacy of different coping strategies, and needs to determine an approach to effectively weigh the possible pros and cons. Managing this information is mentally taxing and can easily overwhelm a patient.

Our patient needs to find and comprehend potentially conflicting evidence about treatment options and side effects. We propose new software, using a variety of artificial intelligence tools built on the interoperability principles promulgated by linked data and the Semantic Web, to address these challenges. 2 The patient uses current technologies to obtain information about her treatment strategy and to formulate promising side effect mitigations. This can be time consuming for anyone, but more so for medically na¨ıve patients. Furthermore, technologies such as web forums or social networking sites are becoming increasingly common for discourse between patients as they can often include anecdotal reports, that have not yet been validated through clinical trials, but may be valuable. They are often presented in layperson terms and sometimes attract new patients who may be less medically literate. Due to lack of scientific rigor, there may be contradictory or unsupported information available, as shown in the following two answers about a mitigation for the very common, taxol-related, nail bed problem:

My onc[ology] nurse told me to rub tea tree oil into my cuticles and nails every night. It is a natural anti-septic and for whatever reason can sometimes help prevent nail infections and lifting during taxol. 1 I wouldn’t use tea tree oil. A friend did on some cracked skin and it got worse. 2 The first suggestion is a common preventive approach for nail problems: tea tree oil prevents nail infections because “it is a natural anti-septic” and appeals to authority “my onc nurse told me to...”. The second suggestion from a different user in the same thread advises against tea tree oil as “a friend [applied tea tree oil] on some cracked skin and it got worse.” Natural Language techniques may be used to extract coping strategies for particular conditions but without deeper knowledge, provenance, and tools, the user may not know how to evaluate and/or integrate potentially contradictory suggestions. We are extending joint extraction techiques proposed in [ 4 ] with semantic background knowledge to aid in extracting linked data from medical records. The Repurposing Drugs using Semantics (ReDrugS) project [ 5 ] has focused on modeling evidence using small units of publishable information called Nanopublications [ 2 ]. ReDrugS utilizes linked data sources to build a knowledge base of nanopublications that is then reasoned about using probabilistic techniques to identify potential links between proteins, drugs, binding sites, and genes, with the ultimate aim of discovering possible new off-label uses for FDA-approved drugs. This project’s success has been partially due to the large corpus of linked data and ontologies generated by the biomedical community over the past few decades. ReDrugS has ingested content from 17 structured curated data sources, including content concerning drugs, alternate names, conditions, and pathways. Once a chemotherapy protocol is extracted from medical notes, ReDrugs can be used to find alternative drug names along with related conditions. This framework, along with the side effect resource SIDER in process, can be used to improve the patient’s process in finding chemotherapy drug side effects and some mitigations by applying its search techniques to authoritative drug resources, such as looking for anti-nausea prescription drugs. The infrastructure for this system could be repurposed for other scientific domains, but only if linked data sources are abundant in those domains or if quality linked data can be generated from automated methods, e.g. via natural language processing of web-based resources.

Explanations We aim to provide extensive explanation mechanisms since explanation is a key component of transparent systems and user studies have shown that explanations are required if agents are to be trusted [ 1 ]. We aid explanation generation through the collection of provenance, modeled using the W3C’s PROV ontology [ 3 ]. PROV-O is a standard for modeling provenance information on the web, which allows tools to integrate distributed provenance information from different systems. We use this provenance to help construct end user explanations that include both lineage of content and support (and opposition) for a statement.

We identify potential evidence on the use of tea tree oil in chemotherapyinduced nail bed problems. Not only would a patient want to know evidence, source, and authoritativeness for both views, she might also want the system further decompose these arguments and present supporting evidence as to the antimicrobial nature of tea tree oil in more authoritative sources (e.g. [ 6 ]).

We claim that we can reuse the ReDrugS content to find prescription drugs for chemotherapy side effects. Provenance may be displayed to show that the recommendation is from a validated authoritative source. While that framework was originally designed to find potential new off-label uses for drugs along with confidence ratings, the explanation component is more critical for our use so that researchers may inspect evidence sources and the methods used to determine the system confidence. Without such explanations, people would have difficulty evaluating competing suggestions. Natural Language Processing can expose some of the unstructured content of medical records as structured content as well as assist in generating linked data from unstructured sources. The ReDrugS framework provides a semanticallyintegrated system combining many different structured biomedical resources to generate a broadly reusable knowledge graph. By integrating the natural language and structured knowledge representation approaches, we can obtain a much richer annotated knowledge base that includes source and confidence information. Our prototypes demonstrate some ways that this rich resource may then be used to help patients and their support networks to discover, integrate, and evaluate information relevant to complicated medical situations and to help form transparent and data-driven hypotheses about how to proceed. We believe these efforts demonstrate some opportunities for future AI-enhanced Linked Sciencebased assistants that use the wealth of structured content as well as the growing grey literature collection.

Acknowledgements The authors thank Heng Ji and Alex Borgida for their discussions that helped shape this work.

References

1. Glass , A. , McGuinness , D.L. , Wolverton , M. : Toward establishing trust in adaptive agents . In: 13th Intl Conference on Intelligent User Interfaces . pp. 227 - 236 ( 2008 )

2. Groth , P. , Gibson , A. , Velterop , J.: The anatomy of a nanopublication . Information Services & Use 30 , 51 - 56 ( 2010 )

3. Lebo , T. , Sahoo , S. , McGuinness , D.L. : PROV-O: The PROV ontology . Tech. rep., W3C ( 2013 )

4. Li , Q. , Ji , H.: Incremental joint extraction of entity mentions and relations . In: Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics ( 2014 )

5. McCusker , J. , Solanki , K. , Chang , C. , Dumontier , M. , Dordick , J. , McGuinness , D.L.: A nanopublication framework for systems biology and drug repurposing . In: CSHALS 2014 ( 2014 )

6. Pazyar , N. , Yaghoobi , R. , Bagherani , N. , Kaerouni , A. : A review of applications of tea tree oil in dermatology . International Journal of Dermatology pp. 784 - 90 ( 2013 )