<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query-driven Data Integration (Short Paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peter K. Schwab</string-name>
          <email>peter.schwab@fau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas M. Wahl</string-name>
          <email>andreas.wahl@fau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Lenz</string-name>
          <email>richard.lenz@fau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Meyer-Wegener</string-name>
          <email>klaus.meyer-wegener@fau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Friedrich-Alexander-Universita ̈t Erlangen-Nu ̈rnberg, Technische Fakulta ̈t, Department Informatik</institution>
          ,
          <addr-line>Lehrstuhl fu ̈r Informatik 6 (Datenmanagement), Martensstr. 3, 91058 Erlangen, Germany https://www6.cs.fau.de</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper describes an ongoing project that pursues the idea of query-driven data integration. Instead of first creating a common global schema and fetching, transforming, and loading the data to be integrated, we start with the queries. They are taken as a specification of information need and thus as the overall purpose of integration. Two repositories are being developed, one for all information related to the queries and one for potential data sources to which those queries may refer. Queries may have very di↵ erent forms, and thus there are many di↵ erent ways how they can be used to make the integration e↵ ort more e cient.</p>
      </abstract>
      <kwd-group>
        <kwd>Database</kwd>
        <kwd>Query</kwd>
        <kwd>Data integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Data integration [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is a task found in many enterprises, causing tremendous
amounts of repeated work. Often, an additional data source is considered
interesting and thus a complex process of data integration is started. A heavy-weight
example is the merger of two companies. Usually, some kind of ETL (“extract
– transform – load”) process is created with great e↵ ort, so that the data from
one system can be made available in the other.
      </p>
      <p>
        Since the e↵ ort is substantial and in some cases even prohibiting, other
approaches like “pay as you go” have been proposed [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They postpone the
integration e↵ ort to the time when the data are actually needed. However, they still
focus on the (incremental) creation of a common global schema.
      </p>
      <p>We continue this line by proposing to look at the queries first. This is our
understanding of the term “query-driven”. It does not mean that we ignore the
other information on the data that may already be available, but the queries are
in the focus. A query is regarded as a specification of information need. Ideally,
it is given in formal notation already, but we accept other forms as well. The
query is written “as if” the new source(s) had already been integrated. So it is
not yet executable. The purpose of data integration is then slightly modified to
making that query executable and to do only the part of the integration process
that is required to achieve that.</p>
      <p>There are many variations and options involved in this scenario: the form of
the query, the annotations or hints pointing to potential data sources, a
repository of data sources found and used in the past, another repository of queries
that have already been used for integration purposes, and many more. The
purpose of this paper is to define some of these steps and to propose methods that
could be used in them. It is assumed that the whole field is by far too large to
be handled in just one project, let alone in one paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Scenarios</title>
      <p>We have been asked to help in data-integration tasks many times. Our insistence
to first name a few queries caused some confusion in the beginning, because even
the users often had the idea to collect lots of data first and to only then think
about possible accesses and analyses. It was not so di cult, however, to convince
them that taking the queries into account would help to focus the development,
and save time and resources.</p>
      <p>
        One of these scenarios is the Aroma-Research Database. Here, we can show
an example of a real “query”, see Fig. 1, that has been given to us in the very
beginning. The elements of this data sheet are all coming from di↵ erent data
sources, some even from remote sites. The details are not important here. Excel
is used so far to enter the data into the system, and the users would like to
continue to do so. So tools extracting data from Excel tables and formatting
them appropriately [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ] must be used. The knowledge of the queries allows to
restrict the e↵ ort to the data that are known to be included in the result.
      </p>
      <p>CAS-Nr.: 123-11-5</p>
      <p>Link Safety Data Sheet</p>
      <p>Properties
fMoromleuclualar C8H8O2
Molarmass 136.15 g/mol
Density a1t.11159°gC/cm³
Meltingpoint 0°C
Boilingpoint 248°C
Purity: 98%
Supplier: Aldrich(Steinheim,
ordernr.: AG8e8rm10a7ny)
deliverydate: 28.01.2009
apmricoeu:nt: 14030.1g0€
Fig. 1. A “query” given to us by the users of the Aroma-Research Database</p>
      <p>A second scenario is currently called “Walhalla DB”. The purpose is to allow
music researchers to view di↵ erent aspects of an opera (we use Richard Wagner’s
“Walku¨re” – Valkyrie – as an example) simultaneously. These are: a video of
some performance, the libretto, and the music sheets. The idea is to allow for
flexible navigation in any of the views, and then synchronizing the other views.
Here, the queries are defined by the given user interface. It is still not fixed, but
already defines a set of accesses to the underlying data sets, which may be calls
of a database API as well as database queries. The relational database at the
moment has only one table (which is actually the data-entry mode preferred by
the music researchers) and will be re-designed in a proper way on the base of the
given queries. Images of music sheets are stored separately. And the performance
video is accessed via YouTube. The results of the design together with the final
queries will be published in due course.</p>
      <p>
        In co-operations with industrial partners, we often found the wish to
associate own data with publicly available data, e. g. weather information or tweets.
It is interesting how the weather may have influenced sales numbers, or how
positive or negative tweets had such an e↵ ect. This strongly suggests to take
the integration of linked open data [
        <xref ref-type="bibr" rid="ref1 ref8">8, 1</xref>
        ] into account, as it has already been
demonstrated by the OPEN project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>So the demand for integration is obvious, and in all these scenarios we
envision a benefit from looking at the queries early.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Query formulation</title>
      <p>Apart from the world of formal query languages, there is also another world even
more relevant to many users: result tables or reports, as in the scenario of the
Aroma-Research Database. Users can easily sketch a form or report with the
data they would like to combine, each field potentially coming from a di↵ erent
source. The data sources are often known, but provide only semi-structured
formats (e. g. Excel, PDF, HDFS). They may be located in other departments
or even enterprises.</p>
      <p>
        Since experts must transform these specifications into more formal languages
anyway, the world of query languages must be investigated, too. Here, queries
may now include elements that are not yet found in the given schema as in
DrillBeyond [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They may be accompanied with hints on the potential sources
for these elements. This triggers a process of searching, downloading,
transformation, and storage, which is quite usual. The di↵ erence now is that the query
is already given, so the e↵ ort can be tailored to it.
      </p>
      <p>We plan to look at di↵ erent query languages here, but to get started, we will
concentrate on SQL first.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Integration with Execution</title>
      <p>The degree to which the queries identify data sources may vary. Sometimes the
sources are well-known, as in the cases of the Aroma-Research Database and
the Walhalla DB. Sometimes there is the mere hope that appropriate open data
sources can be found the World-wide Web. Then searching for data sources
must be the first step. Since this may be a significant e↵ ort, we propose to keep
the information on data sources that have been found and have proven useful in
answering former queries. Knowing the query may help further, e. g. if it includes
a join with local tables. In this case the join attribute—name as well as type, and
values—can give hints on the characteristics of the data source being sought.</p>
      <p>Once the source has been found, execution of the query may begin, doing
integration on the fly. This can mean fetching data from the source and transforming
them to the format required for further processing. It may as well mean to ship
some sub-query to the source mostly to select and filter the data before fetching
them. The latter requires that the source can execute these sub-queries—and of
course the knowledge of the query in the first place. It is a commonplace that
this can reduce the amount of data to be transmitted substantially.</p>
      <p>In the end, the data coming from the source needs to be transformed to the
format required. Again, we see a benefit in having the query at hand and not just
creating some part of a common (global) schema, because we transform exactly
those parts that will be included in the final processing or the query result.</p>
      <p>As far as the query result contributes to the construction of a global schema,
it may of course be used for that, too. This, however, is done after the execution,
so that it does not postpone the delivery.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Repository</title>
      <p>We are currently designing a repository for queries and any information related
to them1. SQL as a query language will be dominant in the beginning, but we
try to remain open for other query languages and forms. Not surprisingly, the
amount of information associated with a single query can be quite substantial.
It includes:
– the query expression in text form,
– a query specification in form of example results,
– the abstract syntax tree of the query,
– the relevant parts of the query as separate features (relations used, attributes
returned, selection predicates, grouping, etc.),
– a summary of the query result,
– frequency of query execution,
– importance of query execution2,
– reference to the software that invokes the query,
– execution cost or time of the query,
– the query execution plan in some system at hand,
– and many others.
1 It is understood that this design takes the queries to be executed on the repository
into account.
2 Some kind of “emergency” query may be very rare, yet very important once executed.
The repository will be designed with evolution in mind, that is, beginning with
a rather small subset of attributes and features, but with a potential to grow.</p>
      <p>The other repository keeps track of the data sources. It makes sense to
remember the data sources that have already been found as well as the experience
with them. Their usefulness in at least some context should be saved. This
includes a notion of trust and reliability. Also, the mapping and transformation of
data elements from that source should be kept.</p>
      <p>Both repositories will be linked. This allows to find the data sources that
have previously been used by a query. Still, the user may decide to replace them
by better sources found in the meantime. And it allows to find all the queries
that have used a given data source in the past. Their owners may be informed
about changes in that source, up to the fact that it may no longer be available.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>
        The literature on data integration is tremendous. The book by Doan et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] may
be considered the standard work now. Hardly any of the approaches, however,
are query-driven. Most of them focus on the construction of a global schema;
not necessarily before using the system, but at least before executing a query.
      </p>
      <p>
        The work on incremental integration is also helpful when doing it
querydriven. The Data Tamer system [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (now a product named “Tamr”) includes
many techniques for that. The already-mentioned systems OPEN and
DrillBeyond [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ] are query-driven to some extent, and provide very useful mechanisms
for the integration of external data sources, which can be adapted to work with
our approach. Query-driven schema expansion as described in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is similar
to our approach, but concentrates on a rather specific form of external data,
namely ratings based on crowd sourcing. The example used is the rating of
movies. Database reverse engineering and SQL tracking [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] can be used to feed a
query repository. Collaborative query management [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] also creates a repository,
but for a di↵ erent purpose: User are supported in writing queries by searching
for similar queries.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Summary and Outlook</title>
      <p>The impetus of the project described in this paper is to take queries into account
when designing a data management system that integrates heterogeneous data
sources. This has many aspects that still need to be investigated in more detail.
Queries can be specified in many di↵ erent ways with di↵ erent properties, and
their execution can be prepared and finally done along many di↵ erent paths. It
is our belief that it is worthwhile to follow at least some of them.</p>
      <p>
        We have already taken a first step as documented in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The repository can
be initialized with the help of query logs. The article proposes a particular
approach to evaluate query logs in the context of data integration. The information
required to do this will also be available in the repository sketched here. The
query log is used to extract knowledge about data sources and thus contributes
to the contents of the second repository. This knowledge can then be utilized by
the data scientists interacting with the data-integration system. The queries may
contain fictional tables and attributes, and the system helps to find appropriate
data sources. This generates a kind of kernel for the first repository.
      </p>
      <p>Please note that it is not necessary to take a puristic view. The query-driven
approach can easily be combined with the classical approach (that we may now
call schema-driven). The message here is simply to consider the queries as well.
Whatever knowledge about the data and their preliminary structure is available,
it should certainly not be ignored.</p>
      <p>Acknowledgement: The authors would like to thank the anonymous reviewers for their
valuable remarks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. Journal on Semantic Web &amp; Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Braunschweig</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eberius</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiele</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>OPEN - enabling nonexpert users to extract, integrate, and analyze open data</article-title>
          .
          <source>Datenbank-Spektrum</source>
          <volume>12</volume>
          (
          <issue>2</issue>
          ),
          <fpage>121</fpage>
          -
          <lpage>130</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van den Brink, H.J., van der Leek, R.C.:
          <article-title>Quality metrics for SQL queries embedded in host languages</article-title>
          .
          <source>In: Proc. Special Session on System Quality and Maintainability (SQM, March</source>
          <volume>20</volume>
          ) in
          <source>conjunction with 11th European Conf. on Software Maintenance and Reengineering: “Software Evolution in Complex Software Intensive Systems” (CSMR</source>
          , Amsterdam, the Netherlands,
          <source>March</source>
          <volume>21</volume>
          -23). p.
          <volume>2</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Das</given-names>
            <surname>Sarma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Bootstrapping pay-as-you-go data integration systems</article-title>
          .
          <source>In: Proc. SIGMOD</source>
          . pp.
          <fpage>861</fpage>
          -
          <lpage>874</lpage>
          . ACM, New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Principles of Data Integration</article-title>
          . Morgan Kaufmann, Waltham, MA, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eberius</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiele</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braunschweig</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>DrillBeyond: Processing multi-result open world SQL queries</article-title>
          .
          <source>In: Proc. 27th Int. Conf. on SSDBM</source>
          . pp.
          <volume>16</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          :
          <fpage>12</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Eberius</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Werner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thiele</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braunschweig</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dannecker</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>DeExcelerator: a framework for extracting relational data from partially structured documents</article-title>
          .
          <source>In: Proc. CIKM</source>
          . pp.
          <fpage>2477</fpage>
          -
          <lpage>2480</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          .
          <source>Synthesis Lectures on the Semantic Web: Theory and Technology</source>
          , Morgan &amp; Claypool (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Khoussainova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balazinska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatterbauer</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwon</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A case for a collaborative query management system</article-title>
          .
          <source>In: Proc. CIDR</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulwani</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>FlashExtract: A framework for data extraction by examples</article-title>
          .
          <source>In: Proc. PLDI</source>
          (Edinburgh, United Kingdom, June 9-11). pp.
          <fpage>542</fpage>
          -
          <lpage>553</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Selke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lofi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balke</surname>
          </string-name>
          , W.T.:
          <article-title>Pushing the boundaries of crowd-enabled databases with query-driven schema expansion</article-title>
          .
          <source>PVLDB</source>
          <volume>5</volume>
          (
          <issue>6</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruckner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beskales</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherniack</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zdonik</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pagan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Data curation at scale: The Data Tamer system</article-title>
          .
          <source>In: Proc. CIDR</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wahl</surname>
            ,
            <given-names>A.M.:</given-names>
          </string-name>
          <article-title>A minimally-intrusive approach for query-driven data integration systems</article-title>
          .
          <source>In: Proc. IEEE 32nd ICDE Workshops (ICDEW)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>