Drug Discovery and Big Linked Data

       Ronald Siebes1 , Victor de Boer1 , Bryn Williams-Jones2 , and Stian
                                 Soiland-Reyes3
                   1
                     VU University Amsterdam, the Netherlands,
                     rm.siebes@few.vu.nl, v.de.boer@vu.nl
            2
               Open PHACTS Foundation, Cambridge, United Kingdom
                        bryn@openphactsfoundation.org
      3
        School of Computer Science, University of Manchester, United Kingdom
                      soiland-reyes@cs.manchester.ac.uk


1     The Open PHACTS Drug Discovery Platform

A large part of the daily practice of a researcher doing in vitro Drug Discov-
ery is comparing and manually matching high-quality information from multiple
disciplines in the Life and Biomedical Sciences. The Open PHACTS Discovery
Platform4 is an initiative to integrate publicly available data relevant for both
academia and the pharmaceutical industry. It integrates numerous datasets in-
cluding for example ChEBI, ChemSpider, DrugBank and the GeneOntology.
The platform provides an easy interface that allows researchers to consult the
database without being confronted with the complexity of defining efficient
Linked Data queries. A set of services are accessible via a RESTful interface.
    The Open PHACTS Discovery Platform provides an interpretation of biomed-
ical research activities (identified by domain experts) as workflows that are au-
thored using visual tools. Workflows retrieve data via API calls. The platform
executes the resulting instantiated queries at an endpoint that serves relevant
data.Currently, the infrastructure uses commercial software to reason over the
vast amount of RDF data and the Big Data Europe (BDE) project took up the
challenge to get the same functionality but via open source Big Data technology.


2     The Big Data Europe infrastructure

The BDE project5 is developing a re-usable Big Data infrastructure (BDI)
needed by data-intensive science practitioners tackling a wide range of soci-
etal challenges. The infrastructure is designed to cover aspects of publishing
and consuming semantically interoperable, large-scale, multi-lingual data assets
and knowledge. This BDE infrastructure is designed to minimize the disruption
of current workflows, and maximizes the opportunities by taking advantage of
the latest European RTD developments, including multilingual data harvesting,
data analytics, and data visualization. To test the effectiveness of the platform,
4
    http://www.openphactsfoundation.org
5
    https://www.big-data-europe.eu
multiple pilot implementations are developed in the various domains. The first of
these pilots is the Drug Discovery Pilot implementation, which replicates much
of the functionalities of the Open PHACTS platform. The infrastructure relies
heavily on the Docker containers6 and configuration via Docker Compose where
generic 3rd party Docker containers (e.g. MemCached, MySQL, SPARK, HDFS)
are combined with custom made pilot specific Docker containers.

3    Drug Discovery Pilot implementation
In the pilot we propose to demonstrate7 , the Open PHACTS functionality is
implemented on the BDI. One goal of this pilot is to investigate dealing with
the significant diversity of the entity name space in the bio-medical domain and
exploring how this issue affects a generic big data infrastructure. Mapping this
vast amount of entities leads to a significant increase of triples. A second goal is
covering data and query security and privacy requirements and exploring how
the methods used to handle this in the current implementation of the Open
PHACTS Discovery Platform can be used to guide development of the generic
BDE platform. An important challenge for this pilot is to replace the commercial
cluster version RDF store, with an open source variant version: 4Store. To this
end, we are implementing a 4Store BDE docker component and improving it in
such a way that it can serve as a generic component on the BDE infrastructure8 .
    The pilot integrates multiple datasets, available in RDF. The mappings be-
tween the identifiers used in the various datasets are freely available as RDF
linksets9 . Most datasets have a metadata description published in VoID. The
functionality of the Open PHACTS services is described in SWAGGER. The
following processing is carried out:
 – Real time processing: Using an external service (such as the Scientific Lenses
   keyword expansion service) to process a query and then to execute the pro-
   cessed query on the data stored in the infrastructure.
 – Batch processing: Data transformations that align and link datasets at inges-
   tion time. The datasets above are regularly updated and must be periodically
   re-ingested.
The pilot implementation exposes a querying endpoint as well as a data ingestion
endpoint for visualization or further processing.
   The pilot itself is available in its entirety as Open Source software10 . Both
BDI and the pilot-specific components are implemented as Docker components.

Acknowledgements. This work is supported by European Union’s Horizon 2020
research and innovation programme under grant agreement No 644564 www.big-
data-europe.eu. We thank our BDE collaborators for their support.
6
   https://www.docker.com/what-docker
7
   https://github.com/big-data-europe/pilot-sc1-cycle1
 8
   https://github.com/big-data-europe/docker-4store
 9
   https://www.openphacts.org/2/sci/data.html
10
   Download and instructions at https://github.com/big-data-europe/pilot-sc1-cycle1