<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LifeDB: An Autonomous Semantic Data Integration System for Life Sciences?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anupam Bhattacharjee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aminul Islam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Shafkat Amin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shahriyar Hossain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shazzad Hosain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hasan Jamil</string-name>
          <email>hmjamilg@wayne.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Wayne State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>? Research supported in part by National Science Foundation grants CNS 0521454 and IIS 0612203.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Data intensive applications in Life Sciences extensively use the Hidden Web as a
platform for information sharing. Access to Hidden Web resources is limited through the
use of prede¯ned web forms and interactive interfaces that users must navigate
manually. Hence, the e®ective use of these resources rely on users' rational interpretation of
the associated schema and the presented information. Since the computational model
for an application is usually in the user's mind, the user is responsible for reconciling
schema heterogeneity, mediating missing information, extracting information and
piping, transforming format and so on in order to implement the desired query sequences
or scienti¯c work°ows. While simple and relatively modest applications can be
implemented and executed this way, large scale scienti¯c investigations are often hard to
capture, reuse, and maintain without the support of a set of sophisticated tools.</p>
      <p>The traditional solution to this problem was to download needed data from di®erent
sources to a local machine and then develop customized applications to implement
the work°ow in mind by manually reconciling the schema and format heterogeneity.
Although this alternative is e±cient, it lacks currency and invites view materialization
related complications. A second alternative is to write glue codes, say in Perl, or Java,
to connect the remote sites, download data, and run queries . Although the advantage
here is increased currency and less maintenance, the disadvantage is the increased cost
of needed programming.</p>
      <p>In LifeDB, we o®er a third alternative that combines the advantages of the previous
two approaches { currency and reconciliation of schema heterogeneity, in one single
platform through a declarative query language called BioFlow. In our approach, schema
heterogeneity is resolved at run time by treating the hidden web resources as a virtual
warehouse, and by supporting a set of primitives for data integration on the °y to
extract information and pipe to other resources, and to manipulate data in a way
similar to traditional database systems in order to meet application demands. We use
a state of the art schema matching system called OntoMatch, a wrapper generation
system called FastWrap, and the latest internet computing tools to build LifeDB and
to design a query processing engine for BioFlow. We o®er several language constructs
to support mixed-mode queries involving XML and relational data, application design
using stored work°ows, structured programming using process de¯nition and reuse, and
work°ow design using ordered process graphs. We demonstrate the salient features of
our system using a substantial set of online examples in real time. Finally, we show that
a graphical tool can be used by a novice user to design applications in BioFlow without
actually knowing anything about the language. Readers may refer to the lab home page
at http://integra.cs.wayne.edu/ for further information on BioFlow and LifeDB.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>