<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Drug Discovery and Big Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ronald Siebes</string-name>
          <email>rm.siebes@few.vu.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victor de Boer</string-name>
          <email>v.de.boer@vu.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bryn Williams-Jones</string-name>
          <email>bryn@openphactsfoundation.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stian Soiland-Reyes</string-name>
          <email>soiland-reyes@cs.manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Open PHACTS Drug Discovery Platform</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Open PHACTS Foundation</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, University of Manchester</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>VU University Amsterdam</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A large part of the daily practice of a researcher doing in vitro Drug Discovery is comparing and manually matching high-quality information from multiple disciplines in the Life and Biomedical Sciences. The Open PHACTS Discovery Platform4 is an initiative to integrate publicly available data relevant for both academia and the pharmaceutical industry. It integrates numerous datasets including for example ChEBI, ChemSpider, DrugBank and the GeneOntology. The platform provides an easy interface that allows researchers to consult the database without being confronted with the complexity of de ning e cient Linked Data queries. A set of services are accessible via a RESTful interface. The Open PHACTS Discovery Platform provides an interpretation of biomedical research activities (identi ed by domain experts) as work ows that are authored using visual tools. Work ows retrieve data via API calls. The platform executes the resulting instantiated queries at an endpoint that serves relevant data.Currently, the infrastructure uses commercial software to reason over the vast amount of RDF data and the Big Data Europe (BDE) project took up the challenge to get the same functionality but via open source Big Data technology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>The Big Data Europe infrastructure</title>
      <p>The BDE project5 is developing a re-usable Big Data infrastructure (BDI)
needed by data-intensive science practitioners tackling a wide range of
societal challenges. The infrastructure is designed to cover aspects of publishing
and consuming semantically interoperable, large-scale, multi-lingual data assets
and knowledge. This BDE infrastructure is designed to minimize the disruption
of current work ows, and maximizes the opportunities by taking advantage of
the latest European RTD developments, including multilingual data harvesting,
data analytics, and data visualization. To test the e ectiveness of the platform,
4 http://www.openphactsfoundation.org
5 https://www.big-data-europe.eu
multiple pilot implementations are developed in the various domains. The rst of
these pilots is the Drug Discovery Pilot implementation, which replicates much
of the functionalities of the Open PHACTS platform. The infrastructure relies
heavily on the Docker containers6 and con guration via Docker Compose where
generic 3rd party Docker containers (e.g. MemCached, MySQL, SPARK, HDFS)
are combined with custom made pilot speci c Docker containers.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Drug Discovery Pilot implementation</title>
      <p>In the pilot we propose to demonstrate7, the Open PHACTS functionality is
implemented on the BDI. One goal of this pilot is to investigate dealing with
the signi cant diversity of the entity name space in the bio-medical domain and
exploring how this issue a ects a generic big data infrastructure. Mapping this
vast amount of entities leads to a signi cant increase of triples. A second goal is
covering data and query security and privacy requirements and exploring how
the methods used to handle this in the current implementation of the Open
PHACTS Discovery Platform can be used to guide development of the generic
BDE platform. An important challenge for this pilot is to replace the commercial
cluster version RDF store, with an open source variant version: 4Store. To this
end, we are implementing a 4Store BDE docker component and improving it in
such a way that it can serve as a generic component on the BDE infrastructure8.</p>
      <p>The pilot integrates multiple datasets, available in RDF. The mappings
between the identi ers used in the various datasets are freely available as RDF
linksets9. Most datasets have a metadata description published in VoID. The
functionality of the Open PHACTS services is described in SWAGGER. The
following processing is carried out:
{ Real time processing: Using an external service (such as the Scienti c Lenses
keyword expansion service) to process a query and then to execute the
processed query on the data stored in the infrastructure.
{ Batch processing: Data transformations that align and link datasets at
ingestion time. The datasets above are regularly updated and must be periodically
re-ingested.</p>
      <p>The pilot implementation exposes a querying endpoint as well as a data ingestion
endpoint for visualization or further processing.</p>
      <p>The pilot itself is available in its entirety as Open Source software10. Both
BDI and the pilot-speci c components are implemented as Docker components.
Acknowledgements. This work is supported by European Union's Horizon 2020
research and innovation programme under grant agreement No 644564
www.bigdata-europe.eu. We thank our BDE collaborators for their support.
6 https://www.docker.com/what-docker
7 https://github.com/big-data-europe/pilot-sc1-cycle1
8 https://github.com/big-data-europe/docker-4store
9 https://www.openphacts.org/2/sci/data.html
10 Download and instructions at https://github.com/big-data-europe/pilot-sc1-cycle1</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>