LifeDB: An Autonomous Semantic Data Integration System for Life Sciences? Anupam Bhattacharjee, Aminul Islam, Mohammad Shafkat Amin, Shahriyar Hossain, Shazzad Hosain, and Hasan Jamil Department of Computer Science, Wayne State University, USA {anupam,aminul,shafkat,shah h,shazzad,hmjamil}@wayne.edu Abstract Data intensive applications in Life Sciences extensively use the Hidden Web as a plat- form for information sharing. Access to Hidden Web resources is limited through the use of predefined web forms and interactive interfaces that users must navigate manu- ally. Hence, the effective use of these resources rely on users’ rational interpretation of the associated schema and the presented information. Since the computational model for an application is usually in the user’s mind, the user is responsible for reconciling schema heterogeneity, mediating missing information, extracting information and pip- ing, transforming format and so on in order to implement the desired query sequences or scientific workflows. While simple and relatively modest applications can be imple- mented and executed this way, large scale scientific investigations are often hard to capture, reuse, and maintain without the support of a set of sophisticated tools. The traditional solution to this problem was to download needed data from different sources to a local machine and then develop customized applications to implement the workflow in mind by manually reconciling the schema and format heterogeneity. Although this alternative is efficient, it lacks currency and invites view materialization related complications. A second alternative is to write glue codes, say in Perl, or Java, to connect the remote sites, download data, and run queries . Although the advantage here is increased currency and less maintenance, the disadvantage is the increased cost of needed programming. In LifeDB, we offer a third alternative that combines the advantages of the previous two approaches – currency and reconciliation of schema heterogeneity, in one single platform through a declarative query language called BioFlow. In our approach, schema heterogeneity is resolved at run time by treating the hidden web resources as a virtual warehouse, and by supporting a set of primitives for data integration on the fly to extract information and pipe to other resources, and to manipulate data in a way similar to traditional database systems in order to meet application demands. We use a state of the art schema matching system called OntoMatch, a wrapper generation system called FastWrap, and the latest internet computing tools to build LifeDB and to design a query processing engine for BioFlow. We offer several language constructs to support mixed-mode queries involving XML and relational data, application design using stored workflows, structured programming using process definition and reuse, and workflow design using ordered process graphs. We demonstrate the salient features of our system using a substantial set of online examples in real time. Finally, we show that a graphical tool can be used by a novice user to design applications in BioFlow without actually knowing anything about the language. Readers may refer to the lab home page at http://integra.cs.wayne.edu/ for further information on BioFlow and LifeDB. ? Research supported in part by National Science Foundation grants CNS 0521454 and IIS 0612203.