Using Semantic Web Technology to Automate Data Integration in Grid and Web Service Architectures Martin Szomszor, Terry R. Payne and Luc Moreau School of Electronics and Computer Science University of Southampton Southampton, SO17 1BJ, UK {mns03r, trp, L.Moreau}@ecs.soton.ac.uk Abstract ipants to exchange information in a format that is mutually intelligible. Given the wide range of heterogeneous data While the Grid and Web Services have helped us support models used by service providers and service consumers, it heterogeneous resource access through the use of service cannot be assumed that data formats are compatible. There- oriented architectures, they have not addressed the issue of fore, additional processing is required to integrate compo- heterogeneous data representation. Since service providers nents using different syntactic structures, a term we refer often describe their service interfaces using different data as syntactic mediation. While this process can be specified models than those assumed by the client, it is common for manually, either through the definition of data transforma- additional processing to be required to compensate for the tions or the creation of bespoke mediator components, it is mismatch in data formats. By utilising technology from desirable to automate it because it will save users effort and the Semantic Web, we are able to augment existing Web allow them to compose services without concern for data Service systems with middleware to automatically perform incompatibilities. To achieve this, we propose to utilise Se- data harmonisation when a syntactic mismatch occurs. To mantic Web technology. achieve this, we have developed a mapping language which The Semantic Web [3] is an extension of the existing Web can be used to annotate XML data structures with OWL con- that aims to support the description of Web resources in cepts and properties, a Mapping Language Engine to imple- formats that are machine understandable. On the Seman- ment this language, and a Dynamic Web Service Invocation tic Web, resources are given well defined meaning by an- component to execute Web Services. notating them with concepts and terminology that correlate with those used by humans. This can be achieved through the use of ontologies [9], providing a conceptual model that 1 Introduction is common to all but independent of concrete representa- tion. Therefore, to provide a framework that supports the automated mediation of syntactic structures, ontologies can Web Services are software components designed to sup- be created that describe information models at a conceptual port interoperable machine to machine interaction over a network. By defining standard languages to present soft- level, and used as a common vocabulary of terms for the exchange of data. ware interfaces, such as WSDL [6], and protocols that de- scribe interaction mechanisms, it is possible for comput- To focus our work, we examine a common service inter- ers to communicate across organisational boundaries from action from a bioinformatics Grid application. We identify a range of heterogeneous platforms. This benefit has been where syntactic incompatibility occurs and why automated noted by both the Grid computing and eBusiness commu- mediation is desirable. We then show the benefits of us- nities who have adopted Web Services as a fundamental ing ontologies to describe XML data structures and how this building block for the development of large scale service- link can be specified using a mapping language. There fol- oriented architectures [8]. In these systems, it is often desir- lows a description of our mapping language and examples able to integrate disparate resources, for example, through of how it can be used to annotate XML data structures. We the creation of a Virtual Organisation on a Grid, or through then present our Mapping Language Engine which imple- Enterprise Application Integration in eBusiness. During ments the mapping language and performs translations be- such a collaboration of resources, it is necessary for partic- tween XML data and OWL concepts. Finally, we show how our Mapping Language Engine can be incorporated with Accession our Dynamic Web Service Invocation component to create Number a system that performs syntactic mediation between Web Services using different data representations. This paper is organised as follows: Section 2 introduces XEMBL DDBJ-XML Get Sequence Data our bioinformatics use case, Section 3 presents the theory behind using semantic annotations, and Section 4 describes Sequence our mapping language, the Mapping Language Engine and Data how it is combined with the Dynamic Web Service Invoker. Section 5 reviews related work before we conclude and show further work in Section 6. NCBI-Blast Sequence Alignment 2 Motivation - Bioinformatics Use Case Blast Results Bioinformatics is the application of computational tech- niques to the management and analysis of biological infor- mation. With the collection and storage of large quanti- Figure 1. A simple bioinformatics task: get ties of genomic and proteomic data, coupled with advanced sequence data from a database and perform computational analysis tools, a bioinformatician is able to a sequence alignment on it. perform experiments and test hypothesis without using con- ventional ‘wet bench’ equipment - a technique commonly referred to as in silico experimentation. To support this kind and another at DDBJ-XML of science, a large collection of databases and tools has been http://xml.ddbj.nig.ac.jp/index.html. developed to provide bioinformaticians with access to mas- To obtain a record, an accession number is passed as sive amounts of biological data and powerful computational input and an XML document is returned. These documents software. essentially contain the same data, namely the sequence The MY G RID 1 project provides an open-source Grid data as a string (e.g. atgagtga...), references to middleware that supports in silico biology. Using a service- publications, and features of the sequence (such as the oriented architecture based on Web Service standards such protein translation). However, the format returned by each as WSDL and UDDI [1], a complex infrastructure has been provider is different - XEMBL returns a BSML formatted created to provide bioinformaticians with a virtual work- document5 and DDBJ returns a document using to their bench with which they can perform biological experi- own custom format6 . ments. Access to data and computational resources is pro- The next stage in the workflow is to pass the sequence vided through Web Services which can be composed using data to an alignment service such as the BLAST service at the workflow language XSCUFL 2 and executed using the NCBI7 . This service can take a string of sequence data as FREEFLUO 3 enactment engine. The biologist is provided input and return the result set in XML. We show this simple with a user interface (Taverna4) which presents the services workflow in Figure 1. available, enables them to create and view workflows graph- Intuitively, a bioinformatician will view the two se- ically, execute them, and view the results. quence retrieval tasks as the same type of operation, ex- For our use case, we examine a common bioinformat- pecting both of them to be compatible with the NCBI Blast ics task: retrieve sequence data from a database and pass service. However, when plugging the two components to- it to an alignment tool to check for similarities with other gether, additional information must be provided to specify known sequences. According to the service-oriented view how data is extracted from one data structure and passed of resource access adhered to by MY G RID, this interaction into the next. This could be achieved using a data trans- can be modelled as a simple workflow with each stage in formation language such as XSLT [7] or XQUERY [4], but the task being fulfilled by a Web Service. it would require the manual specification of all possible Many Web Services are available to retrieve se- transformations. For n compatible data formats, (n − 1)n quence data. For our example, we use one available transformations are required for maximum interoperability. at XEMBL http://www.ebi.ac.uk/xembl/ Also, when a new data type is introduced, mappings to and 1 http://www.mygrid.org.uk 2 http://taverna.sourceforge.net/docs/xscuflspecification.html 5 http://www.ebi.ac.uk/xembl/dtd/BSML2 2.DTD 3 http://sourceforge.net/projects/freefluo/ 6 http://getentry.ddbj.nig.ac.jp/xml/DDBJXML.dtd 4 http://taverna.sourceforge.net/ 7 http://www.ncbi.nlm.nih.gov/BLAST/ BSML_Sequence_Data DDBJ_Sequence_Data object property type and a number of features, represented database_cross_reference molecular_Form by the has-feature object property. There are a num- date_last_updated locus ber of sequence features, we show two common ones in date_created this example; feature_source (where and how the se- quence was gathered), and feature_CDS (which shows Key: Sequence_Data = Subconcept the protein sequence translation and id). Since BSML has-reference format and DDBJ format also contain additional infor- has_feature = Object Property sequence mation on the sequence, we introduce subconcepts called has_reference Sequence_Location BSML_Sequence_Data and DDBJ_Sequence_Data. Reference description accession_id start When examining the two services presented by XEMBL authors journal end and DDBJ, we can consider their input and output to title has-feature be similar; each take a sequence data accession id as input and both return some sequence data. To be Sequence_Feature more specific, the XEMBL service returns the concept location location BSML_Sequence_Data and the DDBJ service returns the concept DDBJ_Sequence_Data . The next service in the workflow, NCBI Blast, takes some sequence data as in- Feature_Source Feature_CDS put, namely an individual of type Sequence_Data with lab_host translation the sequence property type specified. Given that the isolate product mol_type protein_id BSML_Sequence_Data and DDBJ_Sequence_Data con- organism cepts are both subsumed by the Sequence_Data concept, i.e. the Sequence_Data concept is considered more gen- Figure 2. An ontology to describe sequence eral, we say that the output from both of the sequence data data. See http://www.ecs.soton.ac.uk/ retrieval services is semantically compatible with the input ˜mns03r/ont/Sequence for full OWL descrip- to the BLAST service. However, the services are not syntac- tion. tically compatible since the output dataset cannot be passed directly as input to the BLAST service. Therefore, a stage of syntactic mediation is required to extract data from one dataset and transform it to create a new dataset. from all other compatible types would have to be specified. Finally, users are not interested in the details of the service To automate the process of syntactic mediation, we re- interaction; they prefer them to be hidden so they can focus quire mappings from concrete XML structures to conceptual on the scientific problem. ontology structures. To enable this specification, we have We propose an architecture in Section 4 that utilises Se- developed a mapping language, presented in Section 4.1, mantic Web technology to enable the automated mediation which can be used to specify mappings between XML and OWL [12]. Partial mappings for the two sequence retrieval of syntactic structure between Web Services. By annotating XML structures with ontology concepts and properties, de- services is shown in Figure 3. These statements show how scribed in Section 3, we are able to automatically integrate the sequence data and accession id can be retrieved from the XML data structure and used to create new OWL concepts. syntactically incompatible services. A full mapping for each can be found online8 . Due to their complexity, they cannot be listed in full within this paper. 3 Semantic Annotations When using OWL concepts and properties to annotate an XML data structures, we do not require mappings between In this section we show how an ontological definition of all compatible formats. Instead, each data format requires a data format can be used to integrate data structures passed only one mapping to the ontological specification. With this between Web Services. We continue using the bioinformat- approach, the number of mappings required for each com- ics services presented in Section 2. This example is centred patible data format has a complexity of O(n) instead of the around the concept of some ‘sequence data’. We have de- quadratic complexity discussed in Section 2. It is also more vised a simple ontology to express this information, which convenient when adding new formats to an existing system is shown in Figure 2. The main concept, Sequence_Data since only one mapping is required to achieve maximum , has the datatype property sequence (denoting the string interoperability. of sequence data), description (a text annotation) and accession_id (unique id). Each sequence has a number 8 http://www.ecs.soton.ac.uk/˜mns03r/mapping/bsml mapping.mp and of references which is represented by the has-reference http://www.ecs.soton.ac.uk/˜mns03r/mapping/ddbj mapping.mp ‚mappingÚ ::= {‚typeÚ} ‚expÚ ‚mapsymÚ {‚typeÚ} ‚expÚ ‚usingÚ | {xml} {‚typeÚ} ‚expÚ ‚mapsymÚ {‚typeÚ} ‚expÚ | bsml:Bsml( bsml:Definitions( ‚typeÚ ::= xml | owl ‚atomÚ ::= ‚constantÚ | bsml:Sequences( bsml:Sequence[ic-acckey = $accession]( ‚expÚ ::= ‚elemÚ [ ‚attr*Ú ]( ‚exp*Ú ) | ‚variableÚ bsml:Seq-data($sequence) ‚elemÚ ( ‚exp*Ú ) | ‚atom*Ú ::= ‚atomÚ | )))) ‚elemÚ [ ‚attr*Ú ]( ‚exp*Ú )‚elliÚ | ‚atom*Ú , ‚atomÚ <-> ‚elemÚ ( ‚exp*Ú )‚elliÚ | ‚elemÚ ::= ‚qnameÚ {owl} ‚qnameÚ ::= ‚charsÚ : ‚charsÚ | ‚concatÚ ( ‚atom*Ú )| ont:BSML_Sequence_Data( ont:accession_id($accession), ‚concatÚ ( ‚atom*Ú )‚elliÚ| ‚charsÚ ont:sequence($sequence), ‚splitÚ ( ‚atom*Ú )| ‚constantÚ ::= "‚charsÚ" ) ‚splitÚ ( ‚atom*Ú )‚elliÚ| ‚varÚ ::= $‚charsÚ USING ‚constantÚ | ‚mapsymÚ ::= <-> ont for , ‚varÚ ‚elliÚ ::= ... bsml for ‚exp*Ú ::= ‚expÚ | ‚concatÚ ::= concat ‚exp*Ú , ‚expÚ ‚splitÚ ::= split (a) BSML to Sequence Data mapping ‚attrÚ ::= ‚qnameÚ = ‚varÚ | ‚usingÚ ::= USING ‚binding*Ú ‚qnameÚ = "‚constantÚ" ‚bindingÚ ::= ‚prefixÚ for < ‚urlÚ > ‚attr*Ú ::= ‚attrÚ | ‚binding*Ú ::= ‚bindingÚ | ‚attr*Ú , ‚attrÚ ‚binding*Ú , ‚bindingÚ {xml} ddbj:ddbjxml( ‚prefixÚ ::= ‚charsÚ ddbj:accession($accession), ddbj:sequence($sequence) ) Figure 4. The mapping language grammar in <-> {owl} BNF notation. seq:DDBJ_Sequence_Data( seq:accession_id($accession), seq:sequence($sequence), ) USING seq for , respond to child nodes of the parent element (for XML) or ddbj for property types of the parent concept (for OWL). If the sub- expression is a variable, this denotes that the text child of (b) DDBJ to Sequence Data mapping the parent is bound to the the variable (prefixed by a $ sym- bol). Hence, the value of a variable within the source ex- Figure 3. Partial mappings from XML to OWL pression is mapped to the corresponding variable value in for Sequence Data. the destination expression. Constants may be specified for elements in the destination expression to define element or concept constructions that are created independently of the inputs. A list of attributes may also be specified for an XML 4 Architecture mapping by enclosing them within square brackets after the element name. An attribute expression may either assign a In this section we present the grammar and semantics variable to an attribute value or specify the condition that of our mapping language before showing the design of our an element must have an attribute with a specific value to Mapping Language Engine and its integration with out Dy- be valid. An example of the variable assignment attribute namic Web Service Invocation component. construct can be found in Figure 3(a) where the accession id is extracted from an attribute named ic-acckey in the 4.1 Mapping Language element. The hspliti and hconcati expressions may be used in Our mapping language can be used to specify two types source and destination expressions respectively. The hspliti of mapping: ontology concept instances to XML and XML function takes three arguments; a variable, a constant and to ontology concept instances. The grammar for the lan- another variable. When applied to the text of an XML ele- guage is given in Figure 4 using standard BNF notation. ment or the value of an OWL datatype property, the string A mapping is composed of a source type ({type}), source is split into two according to the delimiter specified in the expression, a mapping symbol (<->), a destination type, a second argument and assigned to the two variables speci- destination expression and set of using statements that map fied. If more than one match is found, the string is broken URLs to prefixes. An expression can be one of five kinds: at the first instance of the delimiter. The hconcati expres- helemi, hconstanti, hvari, hspliti or hconcati. An ele- sion can be used in destination expressions to indicate the ment expression corresponds to a concept or property type concatenation of constants or variables. name for an ontology concept instance or the element name We include the hellii (ellipsis) construct to enable the within XML document. The contents of an element, con- processing of lists. This can be utilised when many in- tained within parenthesis, is a sequence of further expres- stances of the same element within XML are to be mapped to sions delimited by a comma. These sub-expressions cor- many concept relations in the ontology (or vice versa). It is Mapping Input Data BSML or DDBJ sequence data). When inserted into the cur- rent model, Jena will automatically classify an individual of Mapping Language Engine either concept as Sequence_Data too since it subsumes Data Model both concepts. Therefore, when creating the XML data set Parser owl Jena for input into the NCBI Blast service, either concept type is Executor valid and the sequence data can be extracted. With this ap- DOM4J proach, users have the freedom to extend existing ontology xml definitions with their own more specific concepts without Output Data breaking compatibility with other more general data mod- els. Figure 5. The mapping language engine de- sign. 4.3 Service Invocation To enable the execution of Web Services, we have cre- also possible to use the hellii suffix with the hconcati oper- ated a Dynamic Web Service Invoker (DWSI). The DWSI ator to indicate multiple element values that map to a single takes an XML representation of the WSDL input and the ser- element. In both cases, the ellipsis construct preserves the vice endpoint and invokes the service. The results of the order of list elements. The inspiration for the ellipsis con- service are returned in XML. In Figure 6, we show how the struct came from the Scheme [10] macro language where it DWSI can be combined with the MLE to create a system that is used for list processing in a similar way. automatically mediates between different representations of the same data. This diagram shows one possible execution 4.2 Mapping Language Engine of our bioinformatics use case. In this instance, the XEMBL service is used to retrieve the sequence data after which it Our Mapping Language Engine (MLE), pictured in Fig- is passed to the NCBI Blast service for analysis. The first ure 5, is a JAVA component built on the Jena Framework9 step is shown in the bottom left of the diagram where the ac- and Dom4J10 . To carry out a transformation, the MLE can cession id is passed to the XEMBL service. The result is a be passed a mapping statement and a source data structure. BSML formatted representation of the sequence data. This A source data structure may be an XML document (using is then passed to the MLE, along with Mapping 1, where it OWL serialisations for ontology concepts) or a reference is translated into a BSML_Sequence_Data concept and to an individual within a Jena Ontology Model. The MLE inserted into the Jena model. The uppermost box in Fig- parses the mapping expression and builds a list of variable ure 6 shows a snapshot of the Jena model with the datatype bindings from the source expression. The destination ex- properties holding example data. To enable the invocation pression is then evaluated and a new data structure is cre- of the NCBI Blast service, the MLE takes Mapping 2 and ated. The result can be returned in either XML (again using creates an XML representation of the sequence data that is OWL syntax for the serialisation of ontology concepts) or compatible with the blast service. This XML is then passed as a reference to a newly created individual within the Jena to a new instance of the DWSI which invokes the service. Ontology Model. Finally, the results of the blast service are returned by the Because we use Jena to store our OWL models, we can DWSI. take advantage of the in-built reasoning it provides, the most We have tested the performance cost of our preliminary useful of which is subsumption. Subsumption, usually de- prototype against hard coded XSLT transformations. On av- noted as C v D, is the reasoning process through which erage, an XSLT transformation takes 30ms where our MLE the concept denoted by D (the subsumer) is checked to see takes approximately 190ms - six time more processing time if it is more general that the concept denoted by C (the to perform the same translation. We consider this an accept- subsumee). With Jena, this task is performed automatically able cost considering the high level of interoperability our when new concepts and individuals are introduced into the system supports. This cost is also a small fraction of the current Jena model. In our example, we see that the re- network time required in a Web Service invocation which is sults from the first stage of the workflow (i.e. the sequence usually around 5000ms or more. retreival services) can be BSML_Sequence_Data or DDBJ_Sequence_Data concepts. When the MLE cre- 5 Related Work ates a new individual to represent the service output from either of these services, it uses the approriate concept (ie. OWL - S [2] is a set of ontology definitions designed to 9 http://jena.sourceforge.net/ capture the behaviour of services. The top level service on- 10 http://www.dom4j.org/ tology presents the service profile, a description of what the Feature_CDS Reference BSML_Sequence_Data translation MSDGAVQPDGG.. authors Horiuchi M. accession_id AB000059 product capsid protein 2 journal Unpublished Ref sequence atgagtgat... protein_id BAA19020.1 titile Evolutionary pattern.. description Feline panleukopenia database_cross_ref GOA:Q84372 has-reference database_cross_ref UniProt/TrEMBL:Q84372 has-feature date_last_updated 21-JAN-1999 Sequence_Location The BSML_Sequence_Data date_created 12-JAN-1997 location start 1 concept is added after end 1755 execution of the XEMBL service Jena Mapping 1 The Mapping Language Engine Mapping 2 The Mapping Language Engine extracts the 'sequence' BSML{xml} inserts new concepts into the Sequence_Data{owl} data property type from the <-> Jena model <-> BSML_Sequence_Data concept BSML_Sequence_Data{owl} NCBI_Blast_In{xml} using Mapping 2 and creates the XML from the XEMBL service is xml input for the NCBI_Blast Service translated by the Mapping Language Engine into a BSML_Sequence_Data Mapping concept using Mapping 1 Language Engine AB000059 Blast Result Sequence Data accesion id XML output from the XEMBL A blast result is obtained is passed to the XEMBL service serivce is passed to the Mapping from the NCBI_Blast service Language Engine Dynamic Web Dynamic Web Service Invoker Service Invoker Service invoked Service invoked using SOAP over HTTP using SOAP over HTTP XEMBL NCBI_Blast Figure 6. An example of how our Mapping Language Engine and Dynamic Web Service Invoker can be combined to automatically perform syntactic mediation. service does (e.g. that a service is used to buy a book). position, and execution of Web Services based on logical The service is described by the service model, which tell inference mechanisms, but with a specific focus on En- us how the service works (e.g. a book buying service re- terprise Application Integration. Conceptually, WSMO is quires the customer to select the book, provide credit card based on an event driven architecture so services do not di- details and shipping information and produces a transaction rectly invoke each other, instead goals are created by clients receipt). Finally, the service supports the service grounding and submitted to the WSMO infrastructure which automat- which specifies the invocation method for the service. In ically manages the discovery and execution of services. the service grounding, XSLT is used to describe how OWL Like OWL - S, WSMO uses ontologies to define formal mod- structures are converted to XML SOAP messages. This es- els of information that have explicit semantics. However, sentially performs the same task as our mapping language, the WSMO framework imposes a standardised message for- but since it is based on transforming the XML serialisation of mat (WSML) which WSMO participants use to communicate the OWL concepts, it is unable to utilise any reasoning tech- with each other. Message adapters can then be placed in- niques. For example, if we expressed the mapping from an front of existing components (such as WSDL Web Services instance of the Sequence_Data concept to the BLAST and databases) to deal with the translations to and from tra- service input using XSLT, it would not be able to transform ditional syntactic data structures. An example of such an an instance of the BSML_Sequence_Data concept be- adapter can be found in Section 5.3 of [11] which performs cause the tag names used in its XML serialisation would be translations between WSML and Universal Business Lan- different. guage (UBL). With this approach the syntactic interface to a business service is hidden because its interface is exposed The Web Services Modelling Ontology (WSMO) [13], only through the WSMO framework. As such, explicit map- adopts a different approach to OWL - S. They also intend to pings from conceptual models to syntactic structures are not provide a framework to support automated discovery, com- required. Once such a registry has been implemented, we can inte- The SEEK project [5] also address the problem of het- grate it with out MLE and DWSI so the appropriate mappings erogeneous data representation in service oriented architec- are retrieved automatically. tures. Within their framework, each service has a number Finally, our last task is to formalise the link between syn- of ports which expose a given functionality. Each port ad- tactic type systems and the description logic models that un- vertises a structural type which represents the format of the derpin the OWL reasoning methods. We believe that a sound data the service is capable of processing. If the output of understanding of the problem will enable us to support a one service port is used as input to another service port, it generic solution that is expressive enough to cope with a is defined as structurally valid when the two types are the wide range of complex data structures. same. Each service port can also be allocated a semantic type which is defined by a reference to a concept within an Acknowledgment OWL ontology. If two service ports are plugged together, they are semantically valid if the output from the first port This research is funded in part by EPSRC myGrid is subsumed by the input to the second port. Structural types project (reference GR/R67743/01). are linked to semantic types by a registration mapping using a custom mapping language based on XPATH. If the con- catenation of two ports is semantically valid, but not struc- References turally valid, an XQUERY transformation can be generated to integrate the two ports, making the link structurally fea- [1] UDDI technical white paper, September 2000. sible. [2] OWL-S: Semantic markup for web service. Technical re- port, The OWL Services Coalition, 2003. [3] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic 6 Conclusions and Further Work web. Scientific American, pages 34 – 43, 2001. [4] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, In this paper, we have used a bioinformatics Grid ap- J. Robie, and J. Simeon. Xquery 1.0: An XML query lan- plication to show the problem of data integration in open, gauge. Technical report, W3C, 2003. [5] S. Bowers and B. Ludascher. An ontology-driven frame- service oriented architectures. We have identified a typi- work for data transformation in scientific workflows. In cal scenario where different syntactic structures are used by Intl. Workshop on Data Integration in the Life Sciences service providers, and how this effects the workflow pro- (DILS’04), 2004. cess. After presenting the motivation behind a framework [6] E. Christensen, F. Curbera, G. Meredith, and S. Weer- to support the automated mediation of syntactic structures, awarana. Web services description language (WSDL) 1.1, we describe our solution, which is based on the use of Se- March 2001. W3C. mantic Web technology. By mapping XML data structures [7] J. Clark. XSL transformations (XSLT) version 1.0. Techni- to OWL concepts and properties, we can describe service in- cal report, W3C, 1999. puts and outputs according to their conceptual types. When [8] I. Foster, C. Kesslemann, J. M. Nick, and S. Tuecke. The services are then plugged together, as in our use case where physiology of the grid, an open grid services architecture for distributed systems integration, June 2002. sequence data is retrieved from a database and passed to [9] T. R. Gruber. A translation approach to portable ontology an alignment service, we can automatically transform data specification. Knowledge Acquisition, (5):199–220, 1993. structures between different formats. [10] R. Kesley, W. Clinger, and J. Rees. Revised (5) report on the In terms of our mapping language, it would be useful to alogrithmic language scheme. Higher-Order and Symbolic incorporate regular expression support for string matching. Computation, pages 7 – 105, 1998. Our current language only provides a simple split opera- [11] M. Moran. D13.5v0.1 WSMX implementation, July 2004. tor that can be used to break down atomic string values into WSMO Working Draft. separate components. With regular expression support, we [12] P. F. Patel-Schneider, P. Hayes, and I. Horrocks. OWL web ontology language semantics and abstract syntax. Technical could allow users to specify more complex string manipu- report, W3C, 2004. lation functions. [13] D. Roman, H. Lausen, and U. Keller. D2v1.0. web ser- Our current architecture assumes that mappings are vice modeling ontology (WSMO), September 2004. WSMO known, therefore, it would be beneficial to create a map- Working Draft. ping repository which exposes a query interface allowing users to register new mappings, discovery new mappings and identify the semantic type of a given XML fragment.