=Paper=
{{Paper
|id=Vol-1670/paper-48
|storemode=property
|title=Query-driven Data Integration (Short Paper)
|pdfUrl=https://ceur-ws.org/Vol-1670/paper-48.pdf
|volume=Vol-1670
|authors=Peter Schwab,Andreas Wahl,Richard Lenz,Klaus Meyer-Wegener
|dblpUrl=https://dblp.org/rec/conf/lwa/SchwabWLM16
}}
==Query-driven Data Integration (Short Paper)==
Query-driven Data Integration (Short Paper) Peter K. Schwab, Andreas M. Wahl, Richard Lenz, and Klaus Meyer-Wegener Friedrich-Alexander-Universität Erlangen-Nürnberg, Technische Fakultät, Department Informatik, Lehrstuhl für Informatik 6 (Datenmanagement), Martensstr. 3, 91058 Erlangen, Germany {peter.schwab,andreas.wahl,richard.lenz,klaus.meyer-wegener}@fau.de https://www6.cs.fau.de Abstract. The paper describes an ongoing project that pursues the idea of query-driven data integration. Instead of first creating a common global schema and fetching, transforming, and loading the data to be integrated, we start with the queries. They are taken as a specification of information need and thus as the overall purpose of integration. Two repositories are being developed, one for all information related to the queries and one for potential data sources to which those queries may refer. Queries may have very di↵erent forms, and thus there are many di↵erent ways how they can be used to make the integration e↵ort more efficient. Keywords: Database, Query, Data integration 1 Introduction Data integration [5] is a task found in many enterprises, causing tremendous amounts of repeated work. Often, an additional data source is considered inter- esting and thus a complex process of data integration is started. A heavy-weight example is the merger of two companies. Usually, some kind of ETL (“extract – transform – load”) process is created with great e↵ort, so that the data from one system can be made available in the other. Since the e↵ort is substantial and in some cases even prohibiting, other ap- proaches like “pay as you go” have been proposed [4]. They postpone the inte- gration e↵ort to the time when the data are actually needed. However, they still focus on the (incremental) creation of a common global schema. We continue this line by proposing to look at the queries first. This is our understanding of the term “query-driven”. It does not mean that we ignore the other information on the data that may already be available, but the queries are in the focus. A query is regarded as a specification of information need. Ideally, it is given in formal notation already, but we accept other forms as well. The query is written “as if” the new source(s) had already been integrated. So it is not yet executable. The purpose of data integration is then slightly modified to making that query executable and to do only the part of the integration process that is required to achieve that. 2 Peter Schwab et al. There are many variations and options involved in this scenario: the form of the query, the annotations or hints pointing to potential data sources, a repos- itory of data sources found and used in the past, another repository of queries that have already been used for integration purposes, and many more. The pur- pose of this paper is to define some of these steps and to propose methods that could be used in them. It is assumed that the whole field is by far too large to be handled in just one project, let alone in one paper. 2 Scenarios We have been asked to help in data-integration tasks many times. Our insistence to first name a few queries caused some confusion in the beginning, because even the users often had the idea to collect lots of data first and to only then think about possible accesses and analyses. It was not so difficult, however, to convince them that taking the queries into account would help to focus the development, and save time and resources. One of these scenarios is the Aroma-Research Database. Here, we can show an example of a real “query”, see Fig. 1, that has been given to us in the very beginning. The elements of this data sheet are all coming from di↵erent data sources, some even from remote sites. The details are not important here. Excel is used so far to enter the data into the system, and the users would like to continue to do so. So tools extracting data from Excel tables and formatting them appropriately [7, 10] must be used. The knowledge of the queries allows to restrict the e↵ort to the data that are known to be included in the result. 253 p-Anisaldehyde ER / FR CAS-Nr.: 123-11-5 4-Methoxybenzaldehyde, anisic aldehyde, 4-anisaldehyde Nr. Konz. Erstellt am Von Standort Lösungen: RS 253 2033,5 µg/ml 19.10.2010 FK ER RS 253b 1789 µg/ml 23.2.2012 JN FR ES 253 47540 µg/ml 3.1.2013 JS ER IS 039-A2 958,8 µg/ml 4.5.2012 FK ER Retention capillary RI max. min. n Indices: DB-5 1261 1274 1244 7 Link Safety Data Sheet DB-FFAP 2029 2040 2009 4 Properties DB-1701 1432 1444 1424 5 Molecular C8H8O2 formula Quality and Quelle Matrix Qualität Konz OT Molar mass 136.15 g/mol OT: ER Wasser Waldmeister, süß, Marzipan 4753 µg/L - 1.119 g/cm³ Density FR Wasser Sweet woodruff-like, sweet 3870 µg/L - at 15 °C Web ? Sweet, powdery, vanilla, anise, - - Melting point 0 °C woody, coumarin and creamy with a Boiling point 248 °C spicy nuance ER Luft Waldmeister - 500 ng/L Purity : 98% Supplier: Aldrich (Steinheim, Isotopic labelled standards: 2H -p-Anisaldehyde IS 039-A2 Link zum Datenblatt 3 Germany) Kriterium OK/Erg. Anmerkung Name, Dat. order nr.: A88107 Qualitäts- sicherung: delivery date: 28.01.2009 Gehalt >99% Verunreinigung: o-Anisaldehyde (ca. 0,7%) CH, 2.2.2009 Link zum Chromatogramm price: 43.10€ MS ok Abgleich mit NIST, Link zum Spektrum (MS-EI) CH, 2.2.2009 amount: 100 g Quant. IS 039-A2 928,4 µg/ml Über Me-Octanoat, Link zur Auswertung AL, 17.3.2012 Fig. 1. A “query” given to us by the users of the Aroma-Research Database Query-driven Data Integration 3 A second scenario is currently called “Walhalla DB”. The purpose is to allow music researchers to view di↵erent aspects of an opera (we use Richard Wagner’s “Walküre” – Valkyrie – as an example) simultaneously. These are: a video of some performance, the libretto, and the music sheets. The idea is to allow for flexible navigation in any of the views, and then synchronizing the other views. Here, the queries are defined by the given user interface. It is still not fixed, but already defines a set of accesses to the underlying data sets, which may be calls of a database API as well as database queries. The relational database at the moment has only one table (which is actually the data-entry mode preferred by the music researchers) and will be re-designed in a proper way on the base of the given queries. Images of music sheets are stored separately. And the performance video is accessed via YouTube. The results of the design together with the final queries will be published in due course. In co-operations with industrial partners, we often found the wish to asso- ciate own data with publicly available data, e. g. weather information or tweets. It is interesting how the weather may have influenced sales numbers, or how positive or negative tweets had such an e↵ect. This strongly suggests to take the integration of linked open data [8, 1] into account, as it has already been demonstrated by the OPEN project [2]. So the demand for integration is obvious, and in all these scenarios we envi- sion a benefit from looking at the queries early. 3 Query formulation Apart from the world of formal query languages, there is also another world even more relevant to many users: result tables or reports, as in the scenario of the Aroma-Research Database. Users can easily sketch a form or report with the data they would like to combine, each field potentially coming from a di↵erent source. The data sources are often known, but provide only semi-structured formats (e. g. Excel, PDF, HDFS). They may be located in other departments or even enterprises. Since experts must transform these specifications into more formal languages anyway, the world of query languages must be investigated, too. Here, queries may now include elements that are not yet found in the given schema as in DrillBeyond [6]. They may be accompanied with hints on the potential sources for these elements. This triggers a process of searching, downloading, transfor- mation, and storage, which is quite usual. The di↵erence now is that the query is already given, so the e↵ort can be tailored to it. We plan to look at di↵erent query languages here, but to get started, we will concentrate on SQL first. 4 Integration with Execution The degree to which the queries identify data sources may vary. Sometimes the sources are well-known, as in the cases of the Aroma-Research Database and 4 Peter Schwab et al. the Walhalla DB. Sometimes there is the mere hope that appropriate open data sources can be found the World-wide Web. Then searching for data sources must be the first step. Since this may be a significant e↵ort, we propose to keep the information on data sources that have been found and have proven useful in answering former queries. Knowing the query may help further, e. g. if it includes a join with local tables. In this case the join attribute—name as well as type, and values—can give hints on the characteristics of the data source being sought. Once the source has been found, execution of the query may begin, doing inte- gration on the fly. This can mean fetching data from the source and transforming them to the format required for further processing. It may as well mean to ship some sub-query to the source mostly to select and filter the data before fetching them. The latter requires that the source can execute these sub-queries—and of course the knowledge of the query in the first place. It is a commonplace that this can reduce the amount of data to be transmitted substantially. In the end, the data coming from the source needs to be transformed to the format required. Again, we see a benefit in having the query at hand and not just creating some part of a common (global) schema, because we transform exactly those parts that will be included in the final processing or the query result. As far as the query result contributes to the construction of a global schema, it may of course be used for that, too. This, however, is done after the execution, so that it does not postpone the delivery. 5 Repository We are currently designing a repository for queries and any information related to them1 . SQL as a query language will be dominant in the beginning, but we try to remain open for other query languages and forms. Not surprisingly, the amount of information associated with a single query can be quite substantial. It includes: – the query expression in text form, – a query specification in form of example results, – the abstract syntax tree of the query, – the relevant parts of the query as separate features (relations used, attributes returned, selection predicates, grouping, etc.), – a summary of the query result, – frequency of query execution, – importance of query execution2 , – reference to the software that invokes the query, – execution cost or time of the query, – the query execution plan in some system at hand, – and many others. 1 It is understood that this design takes the queries to be executed on the repository into account. 2 Some kind of “emergency” query may be very rare, yet very important once executed. Query-driven Data Integration 5 The repository will be designed with evolution in mind, that is, beginning with a rather small subset of attributes and features, but with a potential to grow. The other repository keeps track of the data sources. It makes sense to re- member the data sources that have already been found as well as the experience with them. Their usefulness in at least some context should be saved. This in- cludes a notion of trust and reliability. Also, the mapping and transformation of data elements from that source should be kept. Both repositories will be linked. This allows to find the data sources that have previously been used by a query. Still, the user may decide to replace them by better sources found in the meantime. And it allows to find all the queries that have used a given data source in the past. Their owners may be informed about changes in that source, up to the fact that it may no longer be available. 6 Related Work The literature on data integration is tremendous. The book by Doan et al. [5] may be considered the standard work now. Hardly any of the approaches, however, are query-driven. Most of them focus on the construction of a global schema; not necessarily before using the system, but at least before executing a query. The work on incremental integration is also helpful when doing it query- driven. The Data Tamer system [12] (now a product named “Tamr”) includes many techniques for that. The already-mentioned systems OPEN and DrillBe- yond [2, 6] are query-driven to some extent, and provide very useful mechanisms for the integration of external data sources, which can be adapted to work with our approach. Query-driven schema expansion as described in [11] is similar to our approach, but concentrates on a rather specific form of external data, namely ratings based on crowd sourcing. The example used is the rating of movies. Database reverse engineering and SQL tracking [3] can be used to feed a query repository. Collaborative query management [9] also creates a repository, but for a di↵erent purpose: User are supported in writing queries by searching for similar queries. 7 Summary and Outlook The impetus of the project described in this paper is to take queries into account when designing a data management system that integrates heterogeneous data sources. This has many aspects that still need to be investigated in more detail. Queries can be specified in many di↵erent ways with di↵erent properties, and their execution can be prepared and finally done along many di↵erent paths. It is our belief that it is worthwhile to follow at least some of them. We have already taken a first step as documented in [13]. The repository can be initialized with the help of query logs. The article proposes a particular ap- proach to evaluate query logs in the context of data integration. The information required to do this will also be available in the repository sketched here. The query log is used to extract knowledge about data sources and thus contributes 6 Peter Schwab et al. to the contents of the second repository. This knowledge can then be utilized by the data scientists interacting with the data-integration system. The queries may contain fictional tables and attributes, and the system helps to find appropriate data sources. This generates a kind of kernel for the first repository. Please note that it is not necessary to take a puristic view. The query-driven approach can easily be combined with the classical approach (that we may now call schema-driven). The message here is simply to consider the queries as well. Whatever knowledge about the data and their preliminary structure is available, it should certainly not be ignored. Acknowledgement: The authors would like to thank the anonymous reviewers for their valuable remarks. References 1. Bizer, C., Heath, T., Berners-Lee, T.: Linked data – the story so far. Int. Journal on Semantic Web & Information Systems 5(3), 1–22 (2009) 2. Braunschweig, K., Eberius, J., Thiele, M., Lehner, W.: OPEN - enabling non- expert users to extract, integrate, and analyze open data. Datenbank-Spektrum 12(2), 121–130 (2012) 3. van den Brink, H.J., van der Leek, R.C.: Quality metrics for SQL queries embedded in host languages. In: Proc. Special Session on System Quality and Maintainability (SQM, March 20) in conjunction with 11th European Conf. on Software Main- tenance and Reengineering: “Software Evolution in Complex Software Intensive Systems” (CSMR, Amsterdam, the Netherlands, March 21-23). p. 2 (2007) 4. Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proc. SIGMOD. pp. 861–874. ACM, New York, NY, USA (2008) 5. Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration. Morgan Kaufmann, Waltham, MA, USA (2012) 6. Eberius, J., Thiele, M., Braunschweig, K., Lehner, W.: DrillBeyond: Processing multi-result open world SQL queries. In: Proc. 27th Int. Conf. on SSDBM. pp. 16:1–16:12. ACM (2015) 7. Eberius, J., Werner, C., Thiele, M., Braunschweig, K., Dannecker, L., Lehner, W.: DeExcelerator: a framework for extracting relational data from partially structured documents. In: Proc. CIKM. pp. 2477–2480 (2013) 8. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Clay- pool (2011) 9. Khoussainova, N., Balazinska, M., Gatterbauer, W., Kwon, Y., Suciu, D.: A case for a collaborative query management system. In: Proc. CIDR (2009) 10. Le, V., Gulwani, S.: FlashExtract: A framework for data extraction by examples. In: Proc. PLDI (Edinburgh, United Kingdom, June 9-11). pp. 542–553 (2014) 11. Selke, J., Lofi, C., Balke, W.T.: Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. PVLDB 5(6), 538–549 (2012) 12. Stonebraker, M., Bruckner, D., Ilyas, I.F., Beskales, G., Cherniack, M., Zdonik, S.B., Pagan, A., Xu, S.: Data curation at scale: The Data Tamer system. In: Proc. CIDR (2013) 13. Wahl, A.M.: A minimally-intrusive approach for query-driven data integration sys- tems. In: Proc. IEEE 32nd ICDE Workshops (ICDEW) (2016)