<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ontocloud -a Clinical Information Ontology Based Data Integration System</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Diogo</forename><forename type="middle">F C</forename><surname>Patrão</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Helena</forename><surname>Brentani</surname></persName>
							<email>helena.brentani@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Psychiatry Dept</orgName>
								<orgName type="institution">Univ. of São Paulo</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marcelo</forename><surname>Finger</surname></persName>
							<email>mfinger@ime.usp.br</email>
							<affiliation key="aff1">
								<orgName type="department">Computer Science Dept</orgName>
								<orgName type="institution">Univ. of São Paulo</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Renata</forename><surname>Wassermann</surname></persName>
							<email>renata@ime.usp.br</email>
							<affiliation key="aff1">
								<orgName type="department">Computer Science Dept</orgName>
								<orgName type="institution">Univ. of São Paulo</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Camargo</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Cancer</forename><surname>Center</surname></persName>
						</author>
						<title level="a" type="main">Ontocloud -a Clinical Information Ontology Based Data Integration System</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">EDE75D5265DF5D2281AD82FA3AADA951</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Relevant biomedical research relies on finding enough subjects matching inclusion criteria. Researchers struggle to find eligible patients due to: information scattered in many different databases, incompatible data representation, and the technical knowledge required to work directly with databases. We identified the required features of a clinical data search system and used it to design and evaluate Ontocloud, a prototype based on open source software and open standards of a dynamic ontology based database integration system with inference capabilities. A comparison between Ontocloud and three other database integration system showed that our prototype fulfilled its purpose and can be improved to be used in production.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The technology to quickly retrieve patient information from the Electronic Health Record is crucial to biomedical research.</p><p>Traditional term based search techniques have been failing to bring accurate and precise results, due to the high complexity of this knowledge domain <ref type="bibr" target="#b4">[Chard et al. 2011]</ref>.</p><p>Database integration <ref type="bibr" target="#b13">[Lenzerini 2002</ref><ref type="bibr" target="#b10">][Halevy 2001</ref><ref type="bibr" target="#b9">][Haas et al. 2002]</ref> provides techniques to consolidate information on several source databases through a set of mappings, into a single global database, which is then queried by the user. The most established database integration tools are based on relational databases, which are not tailored to deal with different conceptualizations of the source databases <ref type="bibr" target="#b16">[Sujansky 2002</ref>].</p><p>Data collection for cancer research in a large hospital such as A.C. Camargo Cancer Center is hindered by a series of factors, the most important being: (1) Data is stored in many different databases in diverse ways, constantly changing and evolving; (2) Data is represented in a computer friendly format, hard to understand by physicians and scientists; (3) Collecting data manually is a time-consuming task, and clinical research projects need speed and accuracy on the recruit phase, (4) the same information may be present in different levels of detail, and (5) certain information is not explicitly asserted, but may be inferred based on indirect data.</p><p>In this work, we designed, implemented and evaluated a prototype of a database integration system called Ontocloud, based on open source software and standards. It addresses the issues (1)-( <ref type="formula">5</ref>), by providing some key features: dynamic access to data on source databases; ontologies as the medium for data integration; and inference of concepts, harmonizing the detail level of similar information (the semantic mismatch issue), independence of source databases and data annotation. We describe how we implemented Ontocloud to solve a use case of integrating medical document metadata, and compare its characteristics against three other database integration architectures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Database integration</head><p>A database integration problem is described as taking several sources of complementary data and providing a single view for those sources <ref type="bibr" target="#b10">[Halevy 2001</ref>]. In a theoretical perspective <ref type="bibr" target="#b13">[Lenzerini 2002]</ref>, we can represent a data integration system I as a triple G, S, M , where G is the global view, S is the set of source databases and M is the set of mapping functions from S to G.</p><p>There are two methods for providing the global view G: dynamic or static. Dynamic methods translate a query on global view G to queries on the relevant source databases S and translate back the answers using the mappings M. Static methods (or data warehouse methods) create a materialized global view, by translating and copying all data from the sources S into a new database G.</p><p>Both methods have their benefits and drawbacks. Dynamic methods rely on query rewriting or query answering, which are hard computational problems and therefore imply on slower performance. As they directly query the source databases, results are always up to date. Static methods are easier to set up and faster to query, however there is the need to translate all data on the sources and construct a new database before any queries can be answered. This procedure may require a higher level of access on the source databases and may take a great deal of time and disk space. Also, results are mostly always outdated, and the global database needs to be refreshed periodically <ref type="bibr" target="#b10">[Halevy 2001</ref>].</p><p>Regarding the mappings, database integration systems can be classified as global as view (GAV) or local as view (LAV). Mappings on GAV systems transforms the source database into the global view, and queries are answered by several different algorithms <ref type="bibr" target="#b10">[Halevy 2001</ref>]. Mappings on LAV systems maps the global view into the source, and in order to answer a query presented to the global view G the system should apply query answering (to infer results on G based on results on S) or query rewriting (which translates the mappings from LAV to GAV). GAV mappings are easier for a developer to create than LAV mappings, however the former requires that all source databases are joined in one statement, being thus harder to add and remove sources than LAV. The query answering or rewriting step in a LAV system, depending on the complexity of mappgins, may demand a great deal of computation to be solved, if solvable at all; GAV systems relies on faster algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Ontologies, inference and database integration</head><p>An ontology represents knowledge in a formal framework, as concepts and relationships between pairs of concepts. Ontologies have been considered in heterogeneous database integration due to their ability to perform inferences and potential to deal correctly with the semantic mismatch problem <ref type="bibr">[Wache et al. 2001] [Cruz and</ref><ref type="bibr" target="#b5">Xiao 2005]</ref>.</p><p>Semantic mismatch is a problem that is intrinsic to data integration that usually leads to loss of specificity <ref type="bibr" target="#b16">[Sujansky 2002</ref><ref type="bibr" target="#b11">] [Hull 1997</ref>]. It occurs when two sources of information have fields with similar but incompatible meanings. Usually, when it is necessary to join the two sources, the lowest level of detail should be adopted. In some cases of concept overlap it can be impossible to join sources. Ontologies, in data integration, mitigate information loss for some types of semantic mismatch. Figure <ref type="figure" target="#fig_0">1</ref> presents an example of such mismatch for information on patient smoking. Ontologies can be represented in RDF/XML<ref type="foot" target="#foot_0">1</ref> format or in triplestores, which can be thought of as an equivalent of a database for ontologies. SPARQL<ref type="foot" target="#foot_1">2</ref> is the query language defined for querying data in an ontology. The SPARQL 1.1 specification allows for joining remote endpoints and thus integrating different datasets.</p><p>Inference is the process by means of which new information is derived from existing data from an ontology. Given abstract concepts, general rules can be added to a knowledge base to allow for new facts to be inferred <ref type="bibr" target="#b15">[Russell and Norvig 2003</ref>]. An inference rule is divided in two parts, the head and the body. If the statements on the body is true, then the head statement will also be true. See Figure <ref type="figure" target="#fig_1">2</ref> for an example of an inference rule.</p><p>Query expansion <ref type="bibr" target="#b0">[Bhogal et al. 2007</ref>] achieves inference by applying the rules over the query statements, instead of the facts of the knowledge base. A query q G that specifies concepts presents on the head part of some inference rule may have this statement substituted by the body part of the rule (Figure <ref type="figure" target="#fig_1">2</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Related work</head><p>Calvanese <ref type="bibr" target="#b3">[Calvanese et al. 2007</ref>] describes Mastro-I, a data integration management system designed in order to maintain data complexity within reasonable bounds. It relies on the IBM product Infosphere Federation Server<ref type="foot" target="#foot_2">3</ref> to access source databases. In other work <ref type="bibr" target="#b2">[Calvanese et al. 2011]</ref>, the same group describes a database integration case using Mastro-I, in which five different data models were used, including XML-based and relational databases. The integration was made in two steps: first the different data models were combined using the InfoSphere Federation Server; then the Mastro-I system was used to map those entities into concepts, thus achieving data integration. In this architecture, there are two layers of heterogeneity solving: first, all relational data is mapped at the Federation Server, and then mapped into DL concepts, where integration is actually achieved.</p><p>DBOM <ref type="bibr" target="#b6">[Cure and Bensaid 2008]</ref> is a GAV data integration system that uses decidable fragments of OWL language, OWL-DL and OWL-DL lite, to map results from queries over a relational database to an ontology. Several different relational sources can be used at once. It is able to deal with different degrees of confidence on each source, by configuring parameters on the mappings. It is implemented as a Protégé<ref type="foot" target="#foot_3">4</ref> plug-in, however, it is not cited whether this plugin is available, nor it has been found on the internet for download. As a use case the author presents the integration of two drug databases.</p><p>The Query Integrator System (QIS) <ref type="bibr" target="#b12">[Iller and Adkarni 2004</ref>] is a layer-based architecture that uses ontologies to represent and annotate metadata about the source databases; each change detected on the schemas generates annotations that can be reviewed later. It focuses on a dynamic environment where the source database schemas are constantly changing. Queries are composed by means of a visual tool that presents the annotation about the source databases, and translates these queries into SQL in the source databases.</p><p>Min et al <ref type="bibr" target="#b14">[Min et al. 2009</ref>] integrated two sources of prostate cancer clinical data: one mantained by the Radiation Oncology department and the other from the Tumor Registry. The first contained data about radiotherapy treatment and the other demographic data. Both databases were integrated into one ontology by using a single D2R-Server instance. Integration was done by mapping concepts to two different databases in one single server. The integration was horizontal, as each database contained complementary data about one patient, except for one field, the TNM status, which was present in both.</p><p>Analyzing the available tools, none of them has features allowing to solve all of the clinical database integration issues we verified, except for Mastro-I and DBOM. However, the first relies on non-free software and it requires that relational sources are integrated first on a relational layer (the Infosphere Federation Server), and then on the ontology layer (Mastro). DBOM seems to be an interesting take on the subject, however it is not available anywhere for download. QIS has very interesting features but is based on obsolete standards and software.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Ontocloud design</head><p>Ontocloud was designed to provide dynamic access to a consolidated database global view of several database sources, using ontologies to consolidate heterogeneous data. Given a set of source databases S 1..n , a set of source endpoint E 1..n should be provided. Each source publishes its objects of interest concepts of the global view G by means of a SPARQL endpoint. In order to get answers to a query q G over the global database G, the query must go through two transformation steps: the query expansion step accounts for inference, substituting terms not directly defined on the source endpoints; then the query federator step provides the query with SERVICE clauses that indicate in which source endpoint each concept is to be found (Figure <ref type="figure" target="#fig_2">3</ref>).</p><p>Ontocloud uses four ontologies. The global ontology lists the classes and properties in which the global database will be represented, as well as annotations. The federation ontology specifies the source databases and which classes and properties of the global ontology they implement. The mapping ontology relates tables and columns from a source database to basic concepts on the global ontology. The inference ontology maps derived concepts to basic concepts through an ontology alignment file.</p><p>The global ontology should be the starting point when designing an ontology based database integration system, as the queries to be issued will refer to this ontology. It should be well anotated and descriptive, and should comprise the high-level concepts that will be queried as well as the ones actually on the source databases. Those are called base concepts, because they are directly related to a database object. The others are called derived concepts and should be related to base concepts by rules on the inference ontology. Each relational source database is required to have its own mapping file, which will translate access to the partial RDF graph of the global database G to SQL queries on the actual database objects.</p><p>The federation ontology lists all source databases and which concepts from the global ontology they provide. It is used by the federator step to translate a query q G to a query q E over the source endpoints.</p><p>Those characteristics make Ontocloud an adequate solution to integrating clinical databases, as stated in the introduction: (1) Integrated sources are independent, so adding, modifying or removing sources does not interfere with other sources; (2) The usage of ontologies allows for annotation of concepts, making it easier for a non-technically trained user to understand it; (3) Data is accessed directly from the sources, yielding always upto-date results; (4) Ontologies provides tools for dealing with semantic mismatch; and (5) Inference of higher level concepts based on raw data, making all assumptions about data explicit and easy to audit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Implementation</head><p>Ontocloud implementation was based on open standards and open source software. Its implementation is described in this section (illustrated in Figure <ref type="figure" target="#fig_2">3</ref>). To map source databases as a SPARQL endpoint, we used D2R-server <ref type="bibr" target="#b1">[Bizer and Seaborne 2004]</ref> with custom mapping N3 files. The query execution engine was ARQ, and custom software was implemented to perform the query expansion (to accomplish inference) and query federator (to indicate what are the databases to be looked into) steps. D2RQ <ref type="bibr" target="#b1">[Bizer and Seaborne 2004</ref>] is an OBDA 5 open source software. It is a Jena library that translates access to an RDF ontology specification by means of SQL queries, according to a mapping file. It includes D2R-Server, a server that provides a SPARQL endpoint over the mapped database, and dump-rdf 6 , that converts the entire mapped database to a RDF file. Jena is a "Java framework for semantic web applica-tions"<ref type="foot" target="#foot_6">7</ref> , providing an API for handling RDF, OWL, inference, triple storage and a query engine. JDBC<ref type="foot" target="#foot_7">8</ref> is a Java library that provides an unified API to access several different databases. D2R-Server did not provide any function for date and time operations, so we wrote custom Java classes and used it in SPARQL queries.</p><p>The Query Expansion Step used the inference ontology to translate queries using derived concepts into base concepts. We used the Mediation<ref type="foot" target="#foot_8">9</ref> library, which translates queries on an ontology A to an ontology B by means of an EDOAL <ref type="bibr" target="#b7">[David et al. 2011</ref>] ontology alignment file. However, instead of mapping between two different ontologies, we mapped between concepts of the same ontology, avoiding circular references.</p><p>The Query Federator Step used the federation ontology to translate a query over the global ontology to the source endpoints. For each triple specified in the SPARQL query, it checks in which sources the concepts involved are present, and surrounds the triple with a SERVICE clause. If a concept is present in more than one source, it replaces the triple with a UNION of all SERVICE clauses. The software was written in Java using Jena library.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Use case</head><p>We selected as use case the problem of integrating clinical documents metadata from four information systems used at A.C. Camargo Cancer Center: EHR, which contains most data from clinic services; Pathology, that contains reports from anatomic pathology tests (visual inspection of sample tissues); Image, that contains reports from imaging tests; and Prescriptions, that contains both inpatient evolution (texts describing the patient's day-to-day evolution) and prescriptions of drugs and procedures.</p><p>We retrospectively consulted the Medical Informatics Laboratory ticket system, in which all query request made by doctors, managers and researchers are registered. Based on it, we compiled 17 queries of varying complexity to benchmark our integration system<ref type="foot" target="#foot_9">10</ref> . The Ethics Committee of A. C. Camargo Cancer Center, where this research was conducted, granted a waiver on informed consent. To answer those queries, we designed the global schema layout as depicted on Figure <ref type="figure" target="#fig_3">4</ref> and created the mappings accordingly.</p><p>We looked into the source databases for tables and columns that contained the needed information required by the defined global schema. Most databases contained all fields needed for the desired integration, except for the type of document on Pathology, Image and Prescription databases and the brazilian person registry number (CPF) for physicians on the Prescription database. We inquired physicians and discovered that documents on Pathology and Image databases are always reports and the documents on Prescription database can be a evolution or a prescription, depending whether a field is blank or not; the CPF number could be found for physicians which were linked to another database table, but not all of them (in this case, we simply created a new record without the CPF number).</p><p>To account for missing data, we created simple rules of inference based on knowledge provided by physicians. For Pathology and Image, all documents were stated to have  "PATHOLOGY REPORT" and "IMAGING EXAM REPORT" type. For Prescription, a conditional rule (based on whether a text field has data or not) was used to determine if a document belonged to "PRESCRIPTION" or "INPATIENT EVOLUTION" type. These rules were embedded on the mapping files.</p><p>It was possible to infer patient class based on the presence of certain types of documents on the patient's EHR; we implemented inference rules on the query expansion step using EDOAL ontology alignment file format. Examples of those rules can be found on Suplementary Table <ref type="table" target="#tab_2">2 11 .</ref> We replicated the original databases, by retrieving pertinent tables and columns and storing them into a single MySQL server. To extract the original sources into the MySQL database we used Pentaho Data Integration Community Edition <ref type="bibr" target="#b8">[Golfarelli 2009</ref>]. It is a software suite to design and perform ETL (Extract, Transform, Load -a static database integration method). It allows one to graphically design scripts to extract data from several types of database, transform, mix and store them in a different database table or file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental setup</head><p>In order to assess performance and accuracy of Ontocloud we have set up three other database integration systems, which exhausts all combinations of the main database integration architecture characteristics: dynamic or static data acess, and relational database or ontology data representation. We evaluated accuracy in a qualitative way, by making sure that all 17 queries yielded equivalent results on all database integration systems evaluated. Query performance was evaluated as the total clock time a query took for completion on a integration system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Source to global mapping</head><p>After replicating the source databases, we proceeded to set up all four database integration systems. The Supplementary Figure <ref type="figure" target="#fig_0">1</ref> depicts the experimental setup, and Supplementary Table <ref type="table">3</ref> the database size and extraction times.</p><p>The tools used to set up the other integration systems are as follows:</p><p>• Triplestore: Openlink Virtuoso Universal Server Open Source Edition provides, among many other things, an RDF triple store and a SPARQL endpoint. We chose Virtuoso to implement Triplestore, the static access ontology based integration method. For each of the four source databases, we used D2R to dump data into an N3 file. Those files were imported using Virtuoso Bulk Loader<ref type="foot" target="#foot_11">12</ref> . • Federation: Teiid<ref type="foot" target="#foot_12">13</ref> is an open-source, dynamic relational database integrator system; it allows the creation of views over database resources published on a JBoss<ref type="foot" target="#foot_13">14</ref> server, and it is accessible as a JDBC resource. Federation, the dynamic database integration architecture, was designed as a Teiid instance. For each table in the global schema, we wrote a consolidated view, composed of queries over each source database joined by UNION clauses. Those queries did all necessary mapping to provide the required information, even if it was spread in different tables on the source database. The missing document type of Pathology, Image and Prescription databases was inferred directly on the view statement as a SQL constant or expression. • Replication: The Replication architecture was created by materializing the Federation queries (translated to MySQL dialect) into tables. As in the source databases, every column in each database was indexed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Experiments</head><p>The 17 queries were transcribed to each integration system language (SPARQL for ontology based systems and SQL for the others) and dialect (function names and namespaces were slightly different between MySQL and Teiid, and between Virtuoso and ARQ).</p><p>There is no SERVICE specific optimizations on Jena, and we have not implemented it for Ontocloud. In contrast, Teiid, the software we chose for implementing Federation, was highly optimized for this type of queries. To account for this difference, we implemented two sets of queries for Ontocloud: one using both query expansion and query federator step, and other querying directly the sources with queries tuned by hand. This way, we get the actual running time for current software and an estimation of what the timing would be if there was an optimization step. We ran one single round of all 17 queries in all systems, without time limit and saving the results. To avoid server resource competition, only one one query on a single integration system was executed at a given time.</p><p>The computer server in which the experimental setup was created and tests were performed had 4 cores with 3.00GHz, 64bits, and 8GB of RAM, running CentOS 5. The database software installed was MySQL server version 5.0.95. We also used Pentaho Data Integration Community Edition version 4.0.1, ARQ-2.8.8, PHP 5.2.5, Virtuoso Open Source Edition 6.1.4.3127, D2R-Server 0.8, Java 1.6.0.23, Teiid 7.7 and JBoss 5.1.0 GA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results</head><p>A functional comparison between all systems can be seen on Table <ref type="table" target="#tab_2">1</ref>. All four integration systems were successfully configured and deployed. Except for queries 14 and 17 on Ontocloud Optimized, and queries 10-17 on Ontocloud Unoptimized, which were not completed due to lack of memory, all other queries on all evaluated systems completed successfully and yielded the same results. Ontocloud Optimized performed better than Federation on 7 queries out of 17, and was 15% faster than Ontoclound Raw (without optimizations). Replication was the fastest method of all, followed by Triplestore which performed better than Federation and Ontocloud on 13 queries. Time measurements for all database integration systems can be seen on Figure <ref type="figure" target="#fig_4">5</ref> and Supplementary  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Discussion</head><p>The implementation of Ontocloud and the use case experiments showed that it is an adequate database integration system for clinical data, as it accomplish the five objectives:</p><p>(1) The configuration of source databases was completely independent, except for the Federation Ontology, which lists the URL of each endpoint and the concepts each implements;</p><p>(2) The global ontology contained human-readable descriptions, so data would be easily understandable by non-technical personnel;</p><p>(3) Data is accessed directly from the sources, yielding always up-to-date results; (4) Mappings provideded missing data in a way that is transparent to the end user and (5) Higher level concepts like TratedPatient and InPatient are easily understood by physicians and managers, while being translated by the query expansion step to its definition on raw data, allowing the query to be performed.</p><p>As we set up the integration systems, fundamental differences between Federation and Ontocloud arised. Federation requires that the developer explicitly join all sources in a single database view. That makes adding a new source to it a difficult and risky task, as it is required to work on a SQL statement that involves several different source databases and any mistake may compromise the whole integration system. Each Ontocloud source is configured without the need to take other sources in consideration. Instead, it relies on the Query Federator step, which adds to the original query clauses indicating in which source endpoint each triple will be resolved. Therefore, by keeping the mapping files separated, Ontocloud facilitates the maintenance of source databases.</p><p>Inference on Ontocloud was based on the Mediation library, which allowed us to implement rules by expanding each query term. The inference rules are detached from the database integration itself, and can be maintened independently of the sources. Also, as those rules are represented on an ontology language, it is more suitable for domain experts to maintain it than on the relational methods, in which rules should be implemented on SQL language. It also improves the information management of such a system, as it keeps the raw data (on the mapping ontologies) apart from the higher level concepts (on the inference ontology).</p><p>Ontocloud performance suffered on queries with aggregation or that dealt with date operations. This occurs because SPARQL aggregation keywords and date manipulation functions are not translated directly to SQL, instead all results are retrieved and transformations are performed in memory. That both hindered performance and required a lot of memory. Also the queries generated by Query Federator step contained a lot of SERVICE keywords, each containing only one triple. An important optimization would be to join triples on the same SERVICE pattern, minimizing the access to source endpoints. Also, the order of triples and filters on the SPARQL query is crucial to determine the performance. Those optimizations are beyond the scope of this work, but would certainly put Ontocloud on a par with the other methods. For the purpose stated in this work, the speed of Ontocloud seems a fair tradeoff for the ability of yielding up-to-date results at any time and performing inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Conclusion</head><p>We have successfully designed and implemented Ontocloud to perform ontology-based database integration. It implements important features in an clinical data integration system: The sources are loosely coupled, favoring distributed and dynamic management of sources; uses ontologies to integrate data, which is prone to reuse and more human readable; has dynamic access to sources, always yielding up-to-date results; and allows inference. We believe that this system architecture can be extended and improved, as indicated in the discussion, to become a production level tool very useful in the medical informatics context.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Semantic mismatch. (A) An example of a similar field in two different database db1 and db2; when the db2 field has value 0, it is equivalent to a value of N on db1, &lt;1 and 1 is equivalent to Yes -light usage and 2-3 and &gt;3 to Yes -heavy usage. A reverse mapping would not be possible without loss of information specificity, because the options on db1 regarding light and heavy smokers might mean more than an option on db2. (B) An ontology that classifies all concepts involved on the source databases db1 and db2 from (A). Specific concepts such as Smoker2to3P acksADay are classified under more general concepts, in this case, HeavySmoker and Smoker. Instances of a specific class are considered also as belonging to its parent classes. (C) db3 and db4 contains an example of a semantic mismatch that is impossible to solve: note how the concept Yes -light (10-20) on db3 can be mapped to both light (1-14 CPD) and heavy (15+ CPD) on db4, at the same time that those two concepts on db4 maps each to two concepts on db3 (CPD -Cigarrettes per day).</figDesc><graphic coords="3,153.64,217.46,288.01,229.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. Inference by means of query expansion. (A) The inference rule states that if there is a patient ?x which belongs to class Smoker and Female (rule body), then this patient also belongs to class IncreasedRiskOfBreastCancer (head). (B) The term specified on the query is not stated as a fact on the Knowledge base, however a inference rule allows the terms to be substituted and the query can be answered.</figDesc><graphic coords="4,153.64,234.28,287.99,114.09" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Ontocloud system architecture.</figDesc><graphic coords="6,153.64,99.21,288.00,180.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 .</head><label>4</label><figDesc>Figure 4. Global schema of our use case, for ontology and relational database integration architectures.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 .</head><label>5</label><figDesc>Figure 5. Time that each method took for running the 17 queries. The vertical axis unit is log 10 seconds, and in the horizontal axis we display the query number. Query failures were plotted on the "Timeout" line, above all other measures.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 4 .</head><label>4</label><figDesc></figDesc><table><row><cell cols="5">Integration Data access Data heterogeneity Missing data Annotation</cell><cell>Query</cell></row><row><cell>system</cell><cell>strategy</cell><cell>solving method</cell><cell></cell><cell></cell><cell>expansion</cell></row><row><cell>Ontocloud</cell><cell>Dynamic</cell><cell>By ontology</cell><cell>Mapping</cell><cell>Yes</cell><cell>Yes</cell></row><row><cell>Federation</cell><cell>Dynamic</cell><cell>Least detailed</cell><cell>Mapping</cell><cell>No</cell><cell>No</cell></row><row><cell>Triplestore</cell><cell>Static</cell><cell>By ontology</cell><cell>Materialized</cell><cell>Yes</cell><cell>No</cell></row><row><cell>Replication</cell><cell>Static</cell><cell>Least detailed</cell><cell>Materialized</cell><cell>No</cell><cell>No</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 . Data integration architecture features.</head><label>1</label><figDesc></figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.w3.org/TR/PR-rdf-syntax/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.w3.org/TR/rdf-sparql-query/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www-01.ibm.com/software/data/infosphere/federation-server/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://protege.stanford.edu</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">Ontology Based Data Access</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://d2rq.org/dump-rdf</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">http://jena.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">http://www.oracle.com/technetwork/java/overview-141217.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8">https://github.com/correndo/mediation</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_9">The queries are available at http://diogopatrao.com/ob/ as Supplementary Table1.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_10">http://diogopatrao.com/ob/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_11">http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoader</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_12">http://www.jboss.org/teiid/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_13">http://www.jboss.org/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A review of ontology based query expansion</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bhogal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Macfarlane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="866" to="886" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">D2RQ-treating non-RDF databases as virtual RDF graphs</title>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Seaborne</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd International Semantic Web Conference</title>
				<meeting>the 3rd International Semantic Web Conference</meeting>
		<imprint>
			<date type="published" when="2004">2004. ISWC2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The MASTRO system for ontologybased data access</title>
		<author>
			<persName><forename type="first">D</forename><surname>Calvanese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>De Giacomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lembo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lenzerini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Poggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rodriguez-Muro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rosati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ruzzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">F</forename><surname>Savo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Semantic Web</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="43" to="53" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Mastro-i: Efficient integration of relational data through dl ontologies</title>
		<author>
			<persName><forename type="first">D</forename><surname>Calvanese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>De Giacomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lembo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lenzerini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Poggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rosati</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/" />
	</analytic>
	<monogr>
		<title level="m">Proc. of the 20th Int. Workshop on Description Logics</title>
		<title level="s">CEUR Electronic Workshop Proceedings</title>
		<meeting>of the 20th Int. Workshop on Description Logics</meeting>
		<imprint>
			<date type="published" when="2007">2007. 2007</date>
			<biblScope unit="volume">250</biblScope>
			<biblScope unit="page" from="227" to="234" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A cloud-based approach to medical NLP</title>
		<author>
			<persName><forename type="first">K</forename><surname>Chard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Russell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">A</forename><surname>Lussier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A</forename><surname>Mendonc ¸a</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Silverstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AMIA -Annual Symposium proceedings / AMIA Symposium</title>
				<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="207" to="216" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">The role of ontologies in data integration</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">F</forename><surname>Cruz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xiao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of engineering intelligent systems</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="854" to="863" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Integration of relational databases into OWL knowledge bases: demonstration of the DBOM system</title>
		<author>
			<persName><forename type="first">O</forename><surname>Cure</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-D</forename><surname>Bensaid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 24th International Conference on Data Engineering Workshop</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="230" to="233" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">The alignment API 4.0. Semantic web</title>
		<author>
			<persName><forename type="first">J</forename><surname>David</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Scharffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">T</forename><surname>Santos</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Open Source BI Platforms: A Functional and Architectural Comparison</title>
		<author>
			<persName><forename type="first">M</forename><surname>Golfarelli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Data Warehousing and Knowledge Discovery</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">T</forename><surname>Pedersen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Mohania</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Tjoa</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">5691</biblScope>
			<biblScope unit="page" from="287" to="297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Data integration through database federation</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Haas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">T</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Roth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IBM Systems Journal</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="578" to="596" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Answering queries using views: A survey</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Halevy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="270" to="294" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Managing semantic heterogeneity in databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Hull</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems -PODS &apos;97</title>
				<meeting>the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems -PODS &apos;97<address><addrLine>New York, New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="51" to="61" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">QIS : A Framework for Biomedical Database Federation</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">E L M</forename><surname>Iller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">R N</forename><surname>Adkarni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="523" to="534" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Data integration: A theoretical perspective</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lenzerini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems</title>
				<meeting>the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page">246</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Integration of prostate cancer clinical data using an ontology</title>
		<author>
			<persName><forename type="first">H</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">J</forename><surname>Manion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Goralczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-N</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Beck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Artificial Intelligence: A Modern Approach</title>
		<author>
			<persName><forename type="first">S</forename><surname>Russell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Norvig</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
			<publisher>Pearson Education</publisher>
		</imprint>
	</monogr>
	<note>3rd edition</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Heterogeneous Database Integration in Biomedicine</title>
		<author>
			<persName><forename type="first">W</forename><surname>Sujansky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Biomedical Informatics</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="285" to="298" />
			<date type="published" when="2001">2002. 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Ontology-Based Integration of Information A Survey of Existing Approaches</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wache</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Voegele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Visser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Stuckenschmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hübner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI-01 Workshop: Ontologies and Information Sharing</title>
				<imprint>
			<date type="published" when="2001">2001. 2001</date>
			<biblScope unit="page" from="108" to="117" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
