<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Biological Web Services: Integration, Optimization, and Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Benedikt</string-name>
          <email>michael.benedikt@cs.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo Lopez-Serrano</string-name>
          <email>rls@ebi.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Efthymia Tsamoura</string-name>
          <email>efthymia.tsamoura@cs.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Bioinformatics Institute</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Oxford</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>A vast amount of biological data is now available via web services. Yet the usefulness of this data is limited by the difficulty in performing queries that require data spanning multiple services. We overview a platform which offers integrated data access with minimal user awareness. Users pose high-level queries to this platform, and the system applies a combination of reasoning techniques and cost-based optimization to generate an efficient and reliable implementation on top of the services. We briefly explain the platform's reasoning paradigm, which is based on exact reformulation; we then overview the platform's use on a set of bioinformatics sources, and provide some preliminary results concerning its performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As in most areas of modern society, biology has seen an
explosion in the quantity and variety of available data. On the
web there are an enormous number of resources providing
access to datasets of biological interest; hundreds of
overlapping bioinformatics data resources have been already listed
by 2013. The predominant means for exposing this data is
via web services, generally those abiding by the REST
(Representational state transfer) paradigm. The existence of these
public interfaces to biological data can enable the building of
a wide variety of new applications, from custom data mining
tools to new search interfaces. While many of these new data
resources are useful to biologists in isolation, even more
applications are enabled when scientists combine information
from multiple web-based resources. For example, by
combining expression data with pathway information biologists
can analyse changes in metabolic and signalling processes in
cancer diseases, or understand protein disorders through
comparative genomics and genetic interactions.</p>
      <p>Unfortunately it is difficult for biologists to exploit data
from these resources. The tools available to them today
include hand-crafted scripts (e.g. in python or perl) and
workflow management systems that allow scientists to glue
together modules in a component-based way. The key
problem in existing techniques for accessing web datasources is
that scientists are not isolated from the details of the data
resources. This lack of isolation makes it difficult for them to
create a process that answers the query by making use of the
services (below, we refer to such a process as a “plan”).</p>
      <p>Although bioinformatics resources can often be accessed
via RESTful services, users still have to grapple with a great
diversity in the underlying technologies exposed via the
services. Some services provide a thin wrapper on top of a
traditional database API. Other resources provide keyword-based
interfaces. Still others provide navigational interfaces. Many
of these interfaces impose additional access restrictions —
e.g. certain fields being required, or a limit on the number of
requests or amount of results that can be retrieved. Users may
not even be aware of the restrictions that are present, but they
can impact the results of service calls.</p>
      <p>Example 1. Alice is interested in publications that
reference compounds met in humans. She decides to pull
publication and compound data from the PubMed and ChEBI web
databases. PubMed provides a webservice interface
allowing SQL-like interface to access publications. Examples of
terms with which you can access PubMed publications
include the organisms or the compounds referenced in the
publication, the publication identifier and the publication year.
ChEBI provides two interfaces with which you can access
compounds: a text interface which returns the identifiers of
the compounds that relate to the input text and an
identifierbased interface that returns all the data (e.g., the
molecular structure) associated with the input compound identifier.
The text interface restricts the number of returned compound
identifiers to 500.</p>
      <p>Plans that answer Alice’s query are not obvious: one
first calls the PubMed interface with input “Human”and
then calls the ChEBI identifier interface with input the
compound identifier found within each returned publication
entry. Starting in the opposite way (first calling ChEBI and then
PubMed) will only give an incomplete answer, due to the
limitations in the ChEBI interface.</p>
      <p>The diversity in interfaces may not only make it difficult to
come up with any correct plan, it may make it hard to get a
plan that performs well.</p>
      <p>Example 2. Alice is interested in 2015 publications that
reference bioassays. ChEMBL provides an interface with
which users can access all bioassays without providing any
input It also allows users to do a lookup of a bioassay using
the PubMed identifier of the publication that references this
bioassay. One plan to answer the query is to call the
inputfree interface of ChEMBL and then for each returned
bioassay to call the PubMed interface using the bioassay
publication identifier and the publication year (2015). This plan
will do in total 1,148,942 requests, the number of bioassays
in ChEMBL. Another possibility is to call the PubMed
interface with input 2015 and then for each returned publication
to call the ChEMBL interface with input the returned
publication identifier. This plan will do at most 585,750 requests
as there are in total 585,750 publications in PubMed
published in 2015 and not every 2015 publication references a
bioassay.</p>
      <p>Translating a user query into an efficient plan can certainly
be done manually. However, while traditional bioinformatics
resources hosted locally within a laboratory are quite stable,
the velocity of web-based bioinformatics data substantially
complicates the development of a plan. Not only do the
interfaces characteristics of these resources change over time, but
their performance characteristics change as well.</p>
      <p>The problem of data integration for biology is by no means
a new one — indeed, it has been the subject of intense
study within the artificial intelligence, data management, and
bioinformatics communities. But there are new
technologies that make it timely to make a fresh attack now. First,
some of the protocol heterogeneity has lessened, as more and
more resources are available via the same protocol.
Secondly, much more meta-data is available, in the form of
biological ontologies, such as the Experimental Factor
Ontology (see www.ebi.ac.uk/efo) and the Gene Ontology
(see geneontology.org). Lastly, there has been much
progress in reasoning systems that can support declarative
integration.</p>
      <p>In this paper we consider an integration system that
addresses these challenges. FIBRes (Framework for Integrating
Biological REStful services) allows declarative access to a
number of biological resources available via web services. It
exposes a unified interface — i.e. a “global schema” — via
SQL, and implements user queries sent over the interface on
top of the services. It differs from prior systems for
biological data integration and for web service integration in both the
flexibility of its architecture and in the models of integration it
supports. The first distinction allows FIBRes to be used in the
presence of a variety of constraint and mapping languages.
The second distinction give FIBRes advantages in
performing cost-based optimization of the corresponding middleware
plans. This is particularly relevant in the biological domain,
where distinct plans may vary dramatically in performance,
as our experiments show.</p>
      <p>Organization. Section 2 provides more context for our
work, including an overview of prior research. Section 3
explains the FIBRes system at a high level. Section 4 overviews
our prototype implementation on top of data from the
European Bioinformatics Institute (EBI), one of the leading public
providers of biological data, and provides some preliminary
experimental results. Section 5 discusses open issues and
ongoing work on the system.</p>
    </sec>
    <sec id="sec-2">
      <title>A brief history of biological data integration</title>
      <p>Integration of biological data has been an active research
topic for decades. An enormous body of research work
has appeared, and a number of excellent surveys are
available [Thiam Yui et al., 2011; Gomez-Cabrero et al., 2014;
Goble and Stevens, 2008; Paton, 2008; Hernandez and
Kambhampati, 2004; Lapatas et al., 2015]. We give a brief
overview of some major themes.</p>
      <p>Work in the area can be classified within a number of
dimensions, including:</p>
      <p>Source modelling.</p>
      <p>Many toolsets focus on sources providing uniform
interfaces in a particular data model, such as relational data
[Zhang et al., 2011] or nested relations [Davidson et al.,
2001]. Others employ a finer specification concerning
the querying capabilities of sources [Haas et al., 2001;
Kambhampati et al., 2004; Thakkar et al., 2005]. Query
capability specifications can include whether the source
provides keyed lookup facilities, full query language
access, or something in between.</p>
      <p>Procedural vs. declarative approaches.</p>
      <p>Some toolsets consist of procedural languages for
integration coupled with libraries for certain biological
tasks. Workflow systems, for example, provide either
explicit scripting languages or visual environments for
developing applications that access biological data. They
deal with the problem of building scientific applications
out of components. They are not concerned primarily
with querying data, but in allowing users to glue
together programs that process data. Taverna
[Wolstencroft et al., 2013] is a general-purpose workflow
language that has been applied extensively within
biology. Galaxy [Goecks et al., 2010] is a workflow system
geared specifically towards biology.</p>
      <p>At the other extreme there are approaches that rely
throughout on declarative languages — both for enduser
access to the data and for specification of the
relationships between data items. A prime example of the
declarative approach was TAMBIS [Goble et al., 2001]
a long-running project for biological data integration
utilizing ontologies to express the global schema.</p>
      <p>Relationship of integrated schema and source schemas.
Within declarative approaches, a further distinction
concerns how much indirection is allowed between the
integrated schema and the backend sources. Some toolsets
focus primarily on a simple kind of federation, providing
a single-point of access to multiple biological datasets,
allowing them to appear as a single database, but one
whose schema is simply the union of all tables in each
backend source. This insulates endusers from dealing
with many of the issues in merging data (e.g. performing
joins across sources), but still require them to understand
the details of individual schemas, along with
relationships between data in schemas. Bio-Kleisli [Davidson et
al., 2001] was an early system for federated access to
biological data that follows this model. Chem2Bio2RDF
[Chen et al., 2010] is a more recent system that
supports querying an interlinked model, but without
inferring query results using reasoning.</p>
      <p>Other tools expose a global schema that has a more
complex relationship with sources. More sophisticated
relationships between global schema and stored data are
supported by many tools, based on declarative
mappings, logical constraints relating global and local. The
latter approach allows for much more insulation for
users, albeit at the cost of more complex database
administration, including formation and maintenance of
mappings.</p>
      <p>Target implementation and optimizations.</p>
      <p>Tools differ in the kind of “integration plans” they
generate and the optimizations that can be performed on plans.
Discoverylink [Haas et al., 2001] generated plans on
top of the Garlic middleware system, which came with
a sophisticated optimizer working within a plug-in
architecture. Other tools came with middlware targeting
particular classes of resources. For example, Thakkar
et al. [Thakkar et al., 2005] is one of the few papers
dealing specifically with efficiency issues in declarative
data integration on top of biological web services. They
make use of the common “certain answer semantics” for
querying global schemas defined by powerful
declarative mappins (see discussion below). Plans for
providing such answers will generally require recursion, and
[Thakkar et al., 2005] provides a streaming dataflow
engine that can handle recursive queries. They also
provide optimizations geared towards reducing the number
of web service calls.</p>
      <p>
        FIBRes in context. We now place our own work, FIBRes,
within these dimensions. In terms of data model we view
sources as relations equipped with collection of access
methods
        <xref ref-type="bibr" rid="ref15 ref16 ref25 ref9">(in the same spirit as [Kambhampati et al., 2004; Thakkar
et al., 2005])</xref>
        , which expose look-up style interfaces to the
relations. We focus on purely declarative techniques, providing
SQL access on top of a global schema defined by logic-based
integrity constraints relating it to the local sources. Thus
our work is completely orthogonal to systems like Taverna or
Galaxy, or to declarative systems that do not use reasoning,
such as Chem2Bio2RDF. The major distinction of FIBRes
from prior work in the logic-based space is the semantics we
consider for implementing a query, and the ability (enabled
by our chosen semantics) to do cost-based optimization on
web-services. In terms of semantics, we focus on getting
exact reformulations of a user query Q: a plan PL making use
of the exposed access methods with the property that: for
every instance I for the global and local schemas together that
satisfies the mapping rules and constraints, PL run over the
backend sources in I gives the same result as Q evaluated on
I. Exact reformulations are available for access-determined
scenarios [Benedikt et al., 2016; 2015b]: those where the
information available from the interfaces determines the output
of the query. Access-determinacy always holds in the case
of “global-as-view” mappings, where the global schema is
defined by a set of queries over the sources: for
global-asview, each instance of the sources defines a single instance of
the global schema (see Figure 2, left), and thus any query
over the global schema is access-determined. But
accessdeterminacy also holds in many other cases, such as when
the global schema has several alternative definitions based on
source information, or when the global schema loses
information, but that information does not impact the query (see
Figure 2, right). Thus the exact reformulation approach is
strictly more general than global-as-view. It is less general
than approaches such as “local-as-view” [Lenzerini, 2002]
which allow each instance of the local schema to correspond
to many possible instances of the global schema with
different query results, and defines the answer to a query as being
the intersection over all results (the “certain answers”: see
Figure 2, middle). Thus the certain answer semantics is
welldefined even when access-determinacy fails. The advantage
of exact reformulations over this broader approach is that
exact reformulations are generally smaller in size (and in
runtime efficiency) and easier to optimize than plans that retrieve
the certain answers. For example, exact reformulations never
require recursion, and given that recursively probing web
services is extremely expensive, the ability to avoid recursion
is particularly significant in the domain of web service
integration. The use of exact reformulations has allowed us
to adapt cost-based optimization to logic-based data
integration of services, as we explain in Section 3. Further the exact
reformulation approach does not require strong restrictions
on the constraint or mapping language for capturing the
relationships between and among global schema and sources
(in contrast, e.g. to work in Ontology Based Data Access).
To our knowledge, FIBRes is the only system for logic-based
relational and web-service data integration with cost-based
optimization.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>FIBRes architecture</title>
      <p>FIBRes is built on top of the PDQ system for data integration
[Benedikt et al., 2014; 2015a]. The system’s architecture is
the depicted in Figure 1, and features a meta-data manager, a
planner, a runtime, and wrappers.</p>
      <p>Metadata. FIBRes requires metadata about both backend
sources and the integrated “global schema”. For the
backend sources, the metadata consists of the interfaces that each
source supports, specified as functions that take a fixed set of
key-values (the inputs of the interface) and returning a list of
tuples. The idea is that by binding the inputs to values, the
web service can be invoked to get all matching tuples. We
say that the interface is accessed with that binding.</p>
      <p>For the global schema, FIBRes requires a collection of
tables with their attributes and data types, and a collection
of integrity constraints. The constraints can include
mapping rules that relate backend sources to global schema
tables, as well as relationships and invariants that hold within
the global schema or among the sources. Although the
FIBRes architecture is quite flexible about the constraint
language, the current implementation of FIBRes supports only
dependencies: either tuple-generating dependencies (TGDs)
or equality-generating dependencies (EGDs) [Fagin et al.,
2005].</p>
      <p>Queries and plans. The main function of FIBRes is to
take a user query and search for a plan that is an exact
reformulation of the query. User queries are specified as SQL
basic SELECT queries over the global schema, while plans</p>
      <p>Possible 
Instance of 
Integrated 
Schema </p>
      <p>Backend Sources  Seman&lt;cs </p>
      <p>Query 
Output </p>
      <p>Backend Sources 
Possible 
Instance of 
Integrated 
Schema 
(a) Global as view
(b) General mappings
(c) Access-determined
are a sequence of access commands and middleware data
manipulation commands. Commands refer to a set of temporary
tables that are maintained in the middleware. An access
command takes an interface and the contents of one temporary
table and performs a “bulk access” — accessing the interface
with every tuple in the table, putting the outputs in another
temporary table. A middleware data manipulation command
performs standard database operations (e.g. in relational
algebra) on temporary tables. Thus a join across sources S1
and S2 could be implemented by issuing access commands
on each source, putting the results in temporary tables T1 and
T2, and then performing a middleware command that joins T1
and T2.</p>
      <p>Planner. The planner is the central object in the
architecture, taking a query and searching for an appropriate plan PL.
This is done by reasoning with the dependencies. For each
query Q and set of dependencies , we can come up with a
query Q0 and an additional set of constraints 0 such that: Q
has an exact reformulation iff the entailment</p>
      <p>Q ^</p>
      <p>j= ( 0 ! Q0)
holds. Informally, the entailment above represents a proof
that the information in the interfaces determines the output
of the query Q; a proof that we have the picture in Figure 2,
right, rather than that in Figure 2 center.</p>
      <p>Furthermore, for every proof of the entailment, one can
extract a corresponding plan that is an exact reformulation of
Q. The planning module searches for a proof. If no such
proof exists, this means that the query has no exact
reformulation. For example, in a variation of Example 1 in which
all of the interfaces can only return at most one tuple, Alice’s
query would not have any exact reformulation, and the
system will report this. If there are many proofs, then the planner
searches through all of them using the cost of the
corresponding plan to guide the search. In the case where the constraints
consist of TGDs and EGDs, the same is true for 0, and
thus proofs of the entailment Q ^ j= ( 0 ! Q0) can be
generated using the well-known forward-chaining algorithm
known as the chase [Maier et al., 1979; Fagin et al., 2005;
Onet, 2013].</p>
      <p>The proof-to-plan module is in charge of building
execution plans from proofs, making calls to the reasoner for
consequence closure. The cost module evaluates and compare
the quality of our plans. Cost for access commands is
assumed to be proportional to the number of bindings to a given
access method, multiplied by a per-method constant. In the
absence of any information on the performance of methods,
each method is given the same per-method cost.</p>
      <p>Above we have been very high-level. The details of the
transformation from Q; to Q0; 0, the plan-extraction
algorithm, and the proof calculus, can be found in a [Benedikt et
al., 2015b; 2016]. But for the purposes of what follows, the
important take-away is: we search through a space of proofs,
with each proof validating that we have sufficient accessible
data to answer the query; each proof corresponds to a path
of information requests, and thus in searching the proofs we
are searching through the corresponding sequences of access
methods that can answer the user query. As we search we use
cost to prune the space of alternative proofs and plans to
explore. In Example 2 one proof would correspond to the plan
first making input-free access to ChEMBL and then using the
results in PubMed, while a second proof would correspond
to the plan first accessing PubMed with input 2015 and using
the results to access ChEMBL.</p>
      <p>
        Cost. FIBRes supports a variety of cost functions. For
example, middleware operations can be estimated using
standard “textbook” formulas
        <xref ref-type="bibr" rid="ref23">(e.g. [Ramakrishnan and Gehrke,
2003], Chapters 12 and 14)</xref>
        in terms of an estimate on the
size of the output. In our prototype we only consider the cost
of access commands, ignoring the processing of data in
middleware. For each access command we estimate its cost based
on the average number of tuples returned, which is estimated
from sampling the data. In Example 2 we use our estimates of
the selectivity of accesses to ChEMBL and PubMed to
determine that the second plan does fewer web service calls, and
hence is to be preferred.
      </p>
      <p>Runtime. FIBRes has an execution environment for
evaluating plans. Operators are evaluated in pipelined fashion,
with intermediate results from accesses and relational
operators being buffered in the middleware. To deal with transient
web-service failures in access commands, the middleware has
a parameterized restart threshold, after which it re-initiates an
access by reconnecting to the service.</p>
      <p>Wrappers. The wrapper layer acts as an interface with the
services, containing any service-specific information
(destination urls, service parameters).</p>
    </sec>
    <sec id="sec-4">
      <title>4 Initial data sets and results</title>
      <p>Our initial dataset included data from a variety of biological
datasources, all obtained via the interfaces of the European
Bioinformatics Institute. We chose to integrate data from
these resources as they all support a RESTful access API and
expose relational data.</p>
      <p>ChEMBL is a bioassay database [Bento et al., 2014].
The web services in ChEMBL implement SQL-like
Query </p>
      <p>Rmeaosdounlieng   Promofo tdou lpela n </p>
      <p>Cost es*ma*on module 
access interfaces on top of which users can submit
conjunctive queries over a single table. We
considered the Activity, Assay, Document, Molecule,
Target and TargetComponent web services.
More details about ChEMBL web services can be found
under https://www.ebi.ac.uk/chembl/ws.
UniProtKB [The UniProt Consortium, 2015] provides
functional information on proteins. It consists of
two different datasets, Swiss-Prot and TrEMBL, where
Swiss-Prot is a datasource of manually annotated
proteins, and TrEMBL is an automatically-annotated
supplement. The data in UniProtKB is available to users
both through services that allow key-based lookups
and through services that implement SQL-like
accesses. See http://www.uniprot.org/help/
programmatic_access.</p>
      <p>Reactome [Milacic et al., 2012] is a pathway
database. We integrated the Reactome web
services frontPageItems, queryById and
speciesList (see http://reactomews.
oicr.on.ca:8080/ReactomeRESTfulAPI/
ReactomeRESTFulAPI.html).
frontPageItems takes as input a species name
and returns all the pathways that relate to this species.
queryById allows key-based lookups to species and
pathways, while speciesList allows input-free
access to the species that is found within Reactome.
EuropePMC [Europe PMC Consortium, 2015] provides
access to abstracts and full text of the biomedical
literature from life science journals, online books, and
other resources including PubMed. We integrated
several web services from EuropePMC to pull metadata
about articles, citations, and references. search
performs SQL-like searches over the available
publications and it is parametrised with the type of returned
metadata: when users specify the option idlist the
web service returns the list of identifiers and sources
of the publications that match the input SQL query.
The option lite returns additional metadata about the
queried publications including the publication authors,
title, journal, volume and year. Finally, citations
and references allow identifier-based searches over
the publications that cite (are referenced by) a given
article. More details about EuropePMC web
services can be found under http://europepmc.
org/restfulwebservice.</p>
      <p>In addition, we made use of an aggregate interface from
EBI, EB-eye [Squizzato et al., 2015]. This is a text
search engine that provides access to EBI’s data resources
through a RESTful API. We used EB-eye’s REST
interface to access data from ChEMBL and UniProtKB. Note
that the EB-eye web service returns a subset of the data
that is available through the web services provided by
the different resource. See http://www.ebi.ac.uk/
Tools/webservices/services/eb-eye_rest for
a description of the RESTful interface.</p>
      <p>FIBRes models web services with an SQL-like search
interface as access methods; each access is translated to an SQL
command consisting of a single conjunct. We had to restrict
the data that is pulled per web service call; as FIBRes only
integrates data in relational format, we do not parse the web
service attributes that may be populated by a list of values.
Most of the web services that are integrated are parametrised
with the preferred page size and output format (e.g., JSON
or XML). The page size, was fixed to be the maximum value
allowed by the resource; for the output format, we preferred
JSON if available; otherwise XML.</p>
      <p>Global schema. We created a global schema, mappings,
and constraints manually. The global schema consists of a
set of views that represent the data of the remote datasources.
The objective is to conceal the heterogeneity of the remote
datasources and the different access interfaces, providing, at
the same time, a simple view of the data.</p>
      <p>To take the data in ChEMBL as an example, we
created one view for each web service in this source. That is,
we created views VActivity, VAssay, VDocument, VMolecule,
VT arget and VT argetComponent. The schema of each view is
the output schema of the corresponding web service,
modulo some normalization to change multi-valued attributes to
multiple scalar attributes. We adopted a similar rationale to
create views for the data of the remaining datasources. The
view VP rotein captures the data returned by UniProtKB web
services and its schema is the output schema of the
corresponding services. We also defined the view VP athway with
schema the output schema of the frontPageItems and
queryById web services. In order to represent data in
EuropePMC, we defined the views VP ublication, VCitation
and VReference. VP ublication captures the data returned by
search, while VCitation and VReference capture the data
returned by citations and references, respectively.
Finally, we defined bidirectional mappings between the views
created out of the ChEMBL and UniProtKB web services and
EB-eye’s web service. Figure 2 shows a portion of the global
schema.</p>
      <p>Note that there is a many-to-one relationship between data
sources and virtual tables in the global schema. The
redundancy in data resulted in the existence of multiple plans for
all of our user queries.</p>
      <p>Results. We report results on the performance of FIBRes
on 21 sample queries that were manually created. They
include examples representing the tasks in Example 2 1 as well
as other queries that span multiple resources.</p>
      <p>1ChEBI is not yet modeled in our system, so Example 1 is not in
our benchmark</p>
      <p>We report for each query the time spent by the plan
generated by FIBRes, comparing it with the time spent by a
“randomly generated plan”. To approximate a randomly
generated plan, we ran FIBRes with no cost preference, and
generated multiple correct plans: we then chose one of the plans
randomly. For each query we report the time to get the first
result tuple and the time to get the first 500 result tuples (in
milliseconds). Note that some queries return no tuples (Q8,
Q9, Q17). Q20 returns only about 300 tuples, but getting the
complete answer set is infeasible for the random plan.</p>
      <p>We see that FIBRes-generated plans outperform the
random ones. The large gaps in times illustrate the wide range
of plans that are available for these queries.</p>
      <p>Time to first tuple
FIBRes Random</p>
      <p>Time to 500 tuples</p>
      <p>FIBRes Random</p>
      <p>The time to run the planner for these queries can be
nontrivial, ranging from about 1 sec to several minutes, due to the
large number of plans that need to be considered — details
can be found in the technical report. But note that this time
is often orders of magnitude smaller than the gap between
runtimes shown in Figure 3. Further, the expectation is that
planning need only be done once, while the gains in runtime
are exploited repeatedly over many runs of the plan.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Ongoing work</title>
      <p>Our current FIBRes prototype is only a small step towards
declarative integration of web-service hosted biological data.
We claim only that the current results show some promise in
devising a system that searches through multiple exact
reformulations to find an efficient plan.</p>
      <p>Some of the major ongoing activities in improving FIBRes
are as follows:</p>
      <p>Incorporating query-based access. FIBRes supports
typical key-based lookup services, where the arguments
are values for a fixed set of attributes. Some biological
web services permit a more powerful interface,
allowing arbitrary attribute equalities on a given relation. It is
easy to support such interfaces at the planning level, but
accounting for the trade-offs between sending numerous
restrictive queries versus a single broad query is
challenging. We are currently extending our cost-estimation
procedures to account for these trade-offs.</p>
      <p>Incorporating keyword-based access. An important
trend in biological data management is to store the
underlying data in a document store, such as Lucene, and
provide a web service layer that wraps a keyword-based
interface. Typically a document store will return the top
K matches according to a particular scoring function,
where the scoring function may either be a standard IR
measure (e.g. TF/IDF) or a user-defined function.
Accounting for top-K semantics requires extensive changes
in the FIBRes architecture and optimizer.</p>
      <p>Multi-planning for reliability. Web services often fail,
and thus an important component of an integration
system is facilities for identifying and reacting to failure. As
mentioned in Section 3, FIBRes has support for handling
transient failures, such as arise when a service is
temporarily unreachable or (more commonly) when a
service responds to a large number of requests by throttling
the client. FIBRes currently performs static planning,
returning a single plan that interacts with web-services.
This approach is clearly brittle in the presence of
longduration service failures. We are currently modifying
the planning algorithm to produce alternate plans that
allow resilience in the presence of failures of any
single service. The resulting plan would be coupled with a
performance monitor that manages failover. The
knowledge of semantic relationships between services plays a
key role here; it allows FIBRes to devise alternate plans
that are widely different syntactically, but which provide
the same answers
Going beyond this, we are incorporating dynamic
replanning, allowing the planner to re-plan as new
performance statistics arrive.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Benedikt et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>M.</given-names>
            <surname>Benedikt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leblay</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Tsamoura. PDQ</surname>
          </string-name>
          :
          <article-title>Proof-driven query answering over web-based data</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Benedikt et al., 2015a]
          <string-name>
            <given-names>M.</given-names>
            <surname>Benedikt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leblay</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Tsamoura</surname>
          </string-name>
          .
          <article-title>Querying with access patterns and integrity constraints</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Benedikt et al., 2015b]
          <string-name>
            <given-names>M.</given-names>
            <surname>Benedikt</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. ten Cate</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Tsamoura</surname>
          </string-name>
          .
          <article-title>Generating plans from proofs</article-title>
          .
          <source>In TODS</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Benedikt et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>M.</given-names>
            <surname>Benedikt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leblay</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. ten Cate</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Tsamoura</surname>
          </string-name>
          .
          <article-title>Generating Plans from Proofs: The Interpolation-based Approach to Query Reformulation</article-title>
          . Morgan &amp; Claypool,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Bento et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bento</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gaulton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hersey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Bellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chambers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Krger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Light</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McGlinchey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nowotka</surname>
          </string-name>
          , G. Papadatos,
          <string-name>
            <given-names>R.</given-names>
            <surname>Santos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Overington</surname>
          </string-name>
          .
          <article-title>The ChEMBL bioactivity database: an update</article-title>
          .
          <source>Nuc. acids research</source>
          ,
          <volume>42</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Chen et al.,
          <year>2010</year>
          ]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          , and
          <string-name>
            <surname>D. J. Wild.</surname>
          </string-name>
          <article-title>Chem2bio2rdf: a semantic framework for linking and data mining chemogenomic and systems chemical biology data</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Davidson et al.,
          <year>2001</year>
          ]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Crabtree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Brunk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Overton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. J. Stoeckert</given-names>
            <surname>Jr</surname>
          </string-name>
          .
          <article-title>K2/Kleisli and GUS: Experiments in integrated access to genomic data sources</article-title>
          .
          <source>IBM Systems Journal</source>
          ,
          <volume>40</volume>
          (
          <issue>2</issue>
          ):
          <fpage>512</fpage>
          -
          <lpage>531</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Europe PMC Consortium</source>
          ,
          <year>2015</year>
          ]
          <article-title>Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation</article-title>
          .
          <source>Nuc. acids research</source>
          ,
          <volume>43</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Fagin et al.,
          <year>2005</year>
          ]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          .
          <article-title>Data exchange: Semantics and query answering</article-title>
          .
          <source>Theoretical Computer Science</source>
          ,
          <volume>336</volume>
          (
          <issue>1</issue>
          ):
          <fpage>89</fpage>
          -
          <lpage>124</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Goble and Stevens</source>
          , 2008]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Goble</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          .
          <article-title>State of the nation in data integration for bioinformatics</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>41</volume>
          (
          <issue>5</issue>
          ):
          <fpage>687</fpage>
          -
          <lpage>693</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Goble et al.,
          <year>2001</year>
          ]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Goble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          , G. Ng,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peim</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Brass</surname>
          </string-name>
          .
          <article-title>Transparent access to multiple bioinformatics information sources</article-title>
          .
          <source>IBM Systems Journal</source>
          ,
          <volume>40</volume>
          (
          <issue>2</issue>
          ):
          <fpage>532</fpage>
          -
          <lpage>551</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Goecks et al.,
          <year>2010</year>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goecks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nekrutenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , and The Galaxy Team.
          <article-title>Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences</article-title>
          .
          <source>Genome Biol</source>
          ,
          <volume>11</volume>
          (
          <issue>8</issue>
          ):
          <fpage>R86</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [
          <string-name>
            <surname>Gomez-Cabrero</surname>
          </string-name>
          et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gomez-Cabrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Abugessaisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Teschendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Merkenschlager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ballestar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bongcam-Rudloff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Conesa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Tegne</surname>
          </string-name>
          <article-title>´r. Data integration in the era of omics: current and future challenges</article-title>
          .
          <source>BMC Systems Biology</source>
          ,
          <volume>8</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Haas et al.,
          <year>2001</year>
          ]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kodali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kotlar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Rice</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Swope</surname>
          </string-name>
          .
          <article-title>Discoverylink: A system for integrated access to life sciences data sources</article-title>
          .
          <source>IBM Systems Journal</source>
          ,
          <volume>40</volume>
          (
          <issue>2</issue>
          ):
          <fpage>489</fpage>
          -
          <lpage>511</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Hernandez and Kambhampati</source>
          , 2004]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kambhampati</surname>
          </string-name>
          .
          <article-title>Integration of biological sources: Current systems and challenges ahead</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>33</volume>
          (
          <issue>3</issue>
          ):
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>September 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Kambhampati et al.,
          <year>2004</year>
          ]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kambhampati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Lambrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            and
            <surname>Nambiar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Senthil</surname>
          </string-name>
          .
          <article-title>Optimizing recursive information gathering plans in EMERAC</article-title>
          .
          <source>J.Int. Inf. Sys.</source>
          ,
          <volume>22</volume>
          (
          <issue>2</issue>
          ):
          <fpage>119</fpage>
          -
          <lpage>153</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Lapatas et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lapatas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefanidakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Via</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Schneider</surname>
          </string-name>
          .
          <article-title>Data integration in biological research: an overview</article-title>
          .
          <source>Journal of Biological Research</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Lenzerini</source>
          , 2002]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          .
          <article-title>Data integration: A theoretical perspective</article-title>
          .
          <source>In PODS</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Maier et al.,
          <year>1979</year>
          ]
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Mendelzon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sagiv</surname>
          </string-name>
          .
          <article-title>Testing implications of data dependencies</article-title>
          .
          <source>TODS</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <fpage>455</fpage>
          -
          <lpage>469</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Milacic et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>M.</given-names>
            <surname>Milacic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rothfels</surname>
          </string-name>
          , G. Wu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hermjakob</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. D'Eustachio</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Annotating cancer variants and anti-cancer therapeutics in reactome</article-title>
          .
          <source>Cancers</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>[Onet</source>
          ,
          <year>2013</year>
          ]
          <string-name>
            <given-names>A.</given-names>
            <surname>Onet</surname>
          </string-name>
          .
          <article-title>The chase procedure and its applications in data exchange</article-title>
          .
          <source>In Data Exchange, Information, and Streams</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Paton</source>
          ,
          <year>2008</year>
          ]
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          .
          <article-title>Data integration in the life sciences: Fun, findings and frustrations</article-title>
          .
          <source>In DILS</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>[Ramakrishnan and Gehrke</source>
          , 2003]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          .
          <article-title>Database management systems (3</article-title>
          . ed.).
          <source>McGraw-Hill</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Squizzato et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>S.</given-names>
            <surname>Squizzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Buso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cowley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uludag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pundir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A</given-names>
            <surname>Cham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>McWilliam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lopez</surname>
          </string-name>
          .
          <article-title>The EBI Search engine: providing search and retrieval functionality for biological data from EMBL-EBI</article-title>
          .
          <source>Nuc. acids research</source>
          ,
          <volume>43</volume>
          (
          <issue>W1</issue>
          ),
          <year>July 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [Thakkar et al.,
          <year>2005</year>
          ]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thakkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Ambite</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          . Composing, optimizing, and
          <article-title>executing plans for bioinformatics web services</article-title>
          .
          <source>VLDB J</source>
          .,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <fpage>330</fpage>
          -
          <lpage>353</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>[The UniProt Consortium</source>
          ,
          <year>2015</year>
          ]
          <article-title>The UniProt Consortium</article-title>
          .
          <article-title>UniProt: a hub for protein information</article-title>
          .
          <source>Nuc. Acids Research</source>
          ,
          <volume>43</volume>
          (
          <issue>D1</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>[Thiam</surname>
          </string-name>
          Yui et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>C.</given-names>
            <surname>Thiam Yui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Jik</given-names>
            <surname>Soon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Husain</surname>
          </string-name>
          .
          <article-title>A Survey on Data Integration in Bioinformatics</article-title>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [Wolstencroft et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wolstencroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fellows</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Withers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Owen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soiland-Reyes</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Dunlop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenadic</surname>
          </string-name>
          , P. Fisher, J. Bhagat,
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bacall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hardisty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nieva de la Hidalga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Balcazar Vargas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sufi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Goble</surname>
          </string-name>
          .
          <article-title>The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud</article-title>
          .
          <source>Nuc. Acids Research</source>
          ,
          <volume>41</volume>
          :
          <fpage>557</fpage>
          -
          <lpage>561</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Zhang et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Haider,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Guberman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. Kasprzyk.</surname>
          </string-name>
          <article-title>BioMart: a data federation framework for large collaborative projects</article-title>
          .
          <source>The Journal of Biological Databases and Curation</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>