The Pharmacology Workspace: A Platform for Drug Discovery

EXTENDED ABSTRACT

The investigation and development of new drugs requires that scientists involved in the process deal with multiple information sources. These range from online databases of proteins (e.g. UniProt and Enzyme) and chemicals (e.g. ChEMBL, ChemSpider, and DrugBank), to models of biological pathways (e.g. Reactome, WikiPathways, and KEGG) and scientific literature. These information sources are often held in different formats and sourced from a wide variety of organizations. Together they cover a wide area of the scientific space of interest, but overlap in the data they provide and also record different (or even inconsistent) representations of the same data.

A significant challenge to scientists is the labour intensive integration of datasets. The entities of interest must be identified and mapped to each other to allow complementary information from many data sources to be collated in a single record. For example, ChemSpider contains data about chemical compounds and where they can be sourced, while ChEMBL complements this with data about the bioactivity of drug-like molecules and DrugBank provides information on the clinical use of drugs which contain the molecules. These data sources can be linked based on the chemical structure of the compounds. However, differences in scientific or technical approaches to molecular structure representation mean that different data sources will not always be in agreement, often varying in the charged state of the compound, e.g. "Simvastatin" on ChemSpider 1 and DrugBank 2 . Thus, for successful data integration one must devise strategies that address inconsistencies within the existing data.

The linked data platform being developed in the Open PHACTS project3 aims to overcome these data integration challenges. There are two key entry points into the system, both of which perform resolution from user input to an identifier for a data concept.

The first is through keyword search, as shown in Figure 1. In the pharmacology domain, this is more than just text matching as keywords can often match to multiple often very distinct concepts. For example, when typing "menthol" does the user mean the chemical menthol, or the menthol receptor protein. The user interface supports this disambiguation by providing different entry points, e.g. compound by name or target by name (shown in Figure 1). The Identifier Resolution Service (IRS) translates userentered entity names (in free text form), together with the context information, into known entities within the system (i.e. that have a defined URI). The IRS uses several dictionaries including a custom dictionary of chemical names and synonyms from ChemSpider, as well as MeSH, GO, and SwissProt. The IRS provides data for the auto-complete text box including the preferred name for the entity and a link to its definition. This supports the user in disambiguating the entity that they mean. The identified entity URI can then be used to retrieve further information from the linked data platform.

The second entry point is through chemical structure search that uses a tool for drawing chemical structures which are then converted to a standardised chemical structure representation. This is then processed by the ChemSpider structure search service to return a ChemSpider URI for the chemical entity drawn. The service can also be used for substructure and similarity searches.

The linked data platform leverages the comprehensive work already performed by the community in creating RDF-based datasets, which are relevant for the Open PHACTS project. The current platform uses the ChEMBL and ChEBI datasets provided by the Chem2Bio2RDF project (Chen et al., 2010), the conversion of DrugBank provided by the LODD project (Samwald et al., 2011), and the conversion of the Enzyme database sourced from UniProt (Jain et al., 2009). A significant challenge is ensuring that the RDF versions of the datasets are kept up-to-date with the originals from which they are derived. For example, the Chem2Bio2RDF version of ChEMBL is version 8 whereas the original dataset is now at version 13.

The data sources are integrated using parameterized SPARQL queries that are called through an API exposed by the linked Gray et al. data platform. The API call generates a query containing the URI returned by the IRS. The query is then expanded at execution time using an identity mapping service that equates the data entity URIs from the various data sources. To provide adequate interaction speeds, we have cached the datasets in the linked data platform.

The result for doing a compound lookup with the search term "Aspirin" is shown in Figure 2. Information about the chemcial structure is sourced from ChemSpider, details of its bioactivity are obtained from ChEMBL, and information about the drugs in which the compound is active are obtained from DrugBank. Currently, the provenance of the data points is not shown in the user interface, although this is planned for the public release.

The linked data platform is being developed to answer a set of pharmacology research questions that require data to be integrated from a variety of data sources (Williams et al., 2012). The platform hides the complexities of interacting with the linked data and concepts by exposing an API that provides the core functionality to support a wide variety of drug discovery applications being developed within the Open PHACTS project, although only one has been shown in this demonstration paper.