=Paper=
{{Paper
|id=Vol-3135/darliap_paper5
|storemode=property
|title=Intelligent integration of heterogeneous data for answering analytics queries in multi-cloud environments
|pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper5.pdf
|volume=Vol-3135
|authors=Genoveva Vargas-Solar,Chirine Ghedira-Guegan,Nadia Bennani
|dblpUrl=https://dblp.org/rec/conf/edbt/Vargas-SolarGB22
}}
==Intelligent integration of heterogeneous data for answering analytics queries in multi-cloud environments==
Intelligent integration of heterogeneous data for answering
analytics queries in multi-cloud environments
Genoveva Vargas-Solar1 , Chirine Ghedira-Guégan2 and Nadia Bennani3
1
CNRS, Univ Lyon, INSA Lyon, UCBL, LIRIS, UMR5205, F-69221 Villeurbanne, France
2
Univ Lyon, Université Jean Moulin Lyon 3, LIRIS, UMR5205, iaelyon School of Management, France
3
Univ Lyon, INSA Lyon, CNRS, UCBL, Centrale Lyon, Univ Lyon 2, LIRIS, UMR5205, F-69621 Villeurbanne, France
Abstract
This position paper discusses the design of an approach to enable trusted data integration in a multi-cloud environment in
the presence of heterogeneous and large data sources. This approach is based on mechanisms to compute trust in data and its
providers by applying statistical and probabilistic methods on its provenance. The result is a solution expressing analytics
queries as services coordination that can be enacted on multi-cloud settings. This paper describes the associated challenges
and possible ways of addressing them. The approach and challenges are based on concrete requirements stemming from a
medical scenario related to understanding, modelling, and predicting patients’ conditions associated with sleep apnoea.
Keywords
Intelligent data integration, data analytics queries, multi-cloud, data services, data driven e-health
1. Introduction patient Alice? The number of apnoea intervals of my
patients during the last 10 days? The number of apnoea
The digitalization of companies implies the explosion intervals/night when the patient Bob slept 5H and had a
of data to be integrated and analyzed to answer differ- 200 glucose level 2H before going to sleep?
ent types of questions for retrieving and analyzing data. These queries can be answered by composing on de-
Over the past decade, service-oriented environments mand and stream data services. Beyond the methods
have made it easier for many users to access data through described by the APIs, the services are tagged with QoS
data services. A data service is a software entity accessi- measures such as data freshness, response time, execu-
ble through APIs that describe the methods that provide tion price, among others. As a result, the data services
data either on-demand or continuously. composition includes a combination of QoS preferences
Companies willing to make data-driven decisions cur- that can be interpreted as constraints.
rently use services and process large amounts of data The notion of service level agreement (SLA) can define
and resources. However, these companies have to deal the agreement between the users and the set of services
with the efficiency and cost of processing linked to this used. From this contract, the user’s preferences and re-
avalanche of complex multi-source data deployed on sev- quirements emerge. For example, a user can:
eral clouds for economic reasons. In this context, several
economic actors are led to cross-reference this volumi- 1. privilege the data freshness of the data used to
nous data through data integration, guided by queries answer her/his query, like 1month old data;
that specify the data required by an application, a user, 2. the availability of the service;
or a community of users. 3. the latency of data delivery constraint to millisec-
For example, in the e-health context, there can be ser- onds.
vices providing information about physiological metrics In this context, the challenges and questions that drive
of people, apnoea events, during sleeping hours, glucose the our work are:
measures by day, dietary and training sessions informa-
tion. Through these data services, applications can be, • What models, mechanisms, algorithms are suit-
for example, access training programs performed by the able for data integration when qualitative and
quantitative criteria guide it?
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
• What are the criteria, constraints & requirements
Conference (March 29-April 1, 2022), Edinburgh, UK on the data that guide queries evaluation and data
$ genoveva.vargas-solar@cnrs.fr (G. Vargas-Solar); integration across multiple providers?
chirine.ghedira-guegan@univ-lyon3.fr (C. Ghedira-Guégan); • When we invoke machine intelligence, the ques-
nadia.bennani@insa-lyon.fr (N. Bennani) tion is what intelligence mechanisms to consider
http://www.vargas-solar.com (G. Vargas-Solar)
0000-0001-9545-1821 (G. Vargas-Solar)
for making integration intelligent?
© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
• Finally, does the cloud bring specific challenges
CEUR
CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
when making data integration intelligent?
This position paper discusses the interest of treating munity of users. The classical view of data integra-
analytics queries as trustworthy service-based coordi- tion, where data sources are known in advance, has
nations and considering them as first-class citizens. We been widely addressed in databases: data model equiva-
introduce an approach to answer analytics queries on lence and transformation, schema integration, and query
medical data by building trustworthy data integration. rewriting algorithms. In data integration in the presence
The approach considers that data is provided by services of services that act as data providers, the starting point is
deployed on different cloud providers. Therefore, the a query that expresses needs in terms of the data required
paper highlights the challenges to running service-based and must search for services that can meet these needs.
analytics queries in a multi-cloud environment. The ap- In the approaches [1, 2, 3], the services export their APIs
proach uses mechanisms to infer trust levels in data and (Application Programming Interface) and data in a piv-
providers by applying statistical and probabilistic meth- otal model that can be used to integrate results. Thus,
ods. the data integration problem becomes a query rewrit-
Accordingly, the remainder of the paper is organized ing problem using matching and service composition
as follows. Section 2 discusses related work regarding mechanisms [4].
data distribution on services and service-based querying With the evolution of technology, queries are issued
approaches. Section 3 describes the solution proposed from multiple devices with different constraints and re-
for implementing data analytics on apnea conditions. sults are consumed under other conditions (energy con-
Section 4 illustrates use cases addressed through data sumption, network bandwidth consumption, economic
centred strategies that use machine learning and data cost, privacy, trust and criticality). Data producers do
analytics algorithms. Section 5 concludes the paper and not export the properties of their data and the conditions
discusses future work. under which it is delivered. Consumers express the ex-
pected quality of data [6] and the conditions under which
it will be consumed, such as its veracity, freshness, etc.
2. Related Work These qualitative aspects necessary for data consumption
and the conditions under which queries are to be evalu-
The classical view of data integration has been widely ad-
ated are specified through contracts (SLAs) between data
dressed in databases through data model equivalence and
producers and data consumers and user profiles inform-
transformation, schema integration, and query rewriting
ing about data usage preferences. These specifications
algorithms. The main feature of these approaches is the
have the advantage of guiding and pruning data in the
prior knowledge of the data sources when performing
integration process. Still, they add significant complexity
the integration process. The emergence of data services
to an already complex process, especially in the case of
has revolutionised the problem of data integration be-
queries using a large number of data sources. Indeed,
cause services provide data whose sources and format are
evaluating a query (i.e. rewriting a query) becomes a
not known. Data integration with data provider services
combinatorial problem whose complexity increases with
starts with a query expressing the data requirements and
the expression of quality requirements which are non-
searches for services that can provide them. Also, in
orthogonal constraints.
the approaches [1, 2, 3], the services export their APIs
Heuristics and "best-effort" approaches have already
(Application Programming Interfaces) and data accord-
been proposed in the fields of databases [4] and ser-
ing to a pivotal model that can be used to integrate the
vices (SOA) [7]. Given the heterogeneous context of
results. Thus, the problem of data integration becomes
multi-cloud, multi-device and multi-objective Internet
a rewriting problem where queries are rewritten using
of Things (IoT), the primary challenge is to deliver re-
matching and service composition mechanisms [4]. The
sponses to user requests in a reasonable time and at an
problem is more challenging because of the diversity of
acceptable cost, given the complexity of the rewriting
data requesting and consuming devices and the absence
process. Furthermore, guaranteeing the quality of inte-
of meta-data (i.e. veracity, freshness, etc.) specifying
grated data requires considering the source of the data
the associated conditions of data provision and use (user
and the level of trust in the services that provide it. The
preferences). Services guarantees are established through
idea is to capitalise on the execution history of requests
service level agreements (SLAs) between producers and
by proposing an intelligent process that can learn from
consumers. These agreements have the advantage of
previous integration experiences. The difficulty would
guiding the integration process by pruning the data but
then lie in the diversity of the execution contexts of each
significantly increase the complexity of the process, es-
request (deployed services, expressed needs, critical situ-
pecially in the presence of a large quantity (of the order
ation, required level of confidence, response time, etc.).
of several thousand) of data sources.
Data integration [5] is driven by queries that specify
the data required by an application, a user or a com-
Pharmacist
ako – health care
provider
Rhône SLA
Observer
Patient health Pharmacist
record ako – health care
provider
Registry
Services Classifier
Manufacturer Manufacturer
Health-care Health-care OSF
provider provider (Sleep Observatory of the
CPAP CPAP Federation of Pneumology)
Figure 1: ADEL servitization on the cloud.
3. Trusted Apnoea Data dimension considering that the function of data providers
and consumers is not enough to be able to expose SLAs.
Integration for Answering We defined five services groups hosted in secure pri-
Analytics Queries vate or public cloud providers (Azure, Google Cloud and
AWS) to ensure different quality guarantees. The groups
The approach proposed in this paper has been proposed consist of services related to the following data: (1) Obser-
in the context of the project SUMMIT. As a technology vations collected by CPAPs (the masks used by patients
transfer exercise, the project has a partner, the startup to treat their apnoea). (2) Questionnaires used by neumol-
Datamedcare, that promotes the platform ADEL that in- ogists to follow patients. (3) Health care providers’ data
tegrates different actors that intervene in treating a con- regarding patients’ subscriptions. (4) Pharmacies data
dition called sleep apnoea. about treatments for apnoea patients. (5) Data collected
Neumologists treat the condition, and part of the treat- by the association of neumologists studying apnoea.
ment includes using devices called CPAPs intended to In this context, analytics queries issued by actors can
be used every day during people’s sleep. The device is specify SLA contracts regarding, for example, the avail-
technically calibrated according to patients’ physical and ability of the services managing data, their response time,
physiological characteristics. Doctors receive records and the quality of the data like freshness, frequency of
that include the use of the device and other information upload, etc.
from the patient and decide how to adjust the treatment In this setting, we assumed that analytics queries with
if things are not going well. their SLA specifications are rewritten into services com-
The platform ADEL integrates the information and positions enacted to answer them [6]. Query rewriting
provides a global view of the protocols implemented to from our point of view implies:
follow patients. The integration is loosely coupled, pre-
venting doctors from automatically analysing conditions • Determining which are the data services to be
by correlating data. Besides, given the medical context, it used?
is essential to ensure the respect of SLA so that doctors, • Figuring out to which extent potential services
patients, and Care providers can be confident about the and compositions fulfil SLA?
quality of the data and the data services used to exchange • Studying which deployment strategy is adapted
them among these actors. The data provision strategy of for enacting a target service composition?
ADEL does not technically consider this SLA notion.
Given the complexity and non-triviality of the apnoea
case, the questions and requirements can be summarized
3.1. Servitization of the ADEL platform in two categories:
First of all, behind the scenes, we worked on a servitiza-
tion phase of the current setting of the ADEL platform Apnoea questions to answer. In the context of the
(see Figure 1). The objective of this phase was to ensure Apnoea application to which we transfer our SLA guided
the independence of the actors and the type of data and intelligent data integration through queries we have a
infrastructure they use to produce and manage this data. kind of baseline questions provided by Apnoea special-
Servitization was important because we included a QoS ists. These are essentially analytics questions willing to
understand the conditions and evolution of the disease, - data quality evaluation module which imple-
particularly concerning the CPAP’s use. ments our observability protocol;
Questions include an analysis of the way CPAPs are - data service trust measuring module which col-
used. The idea is to determine the extent to which pa- lects both data quality and performance measure-
tients adapt to a given CPAP model and how this adap- ments and computes data services trust scores.
tation determines whether they will be assiduous. As- This meta-data is not included in SLA’s models.
siduity can result in a positive evolution of their condi- In our future work, we will propose the extension
tion. of SLA’s for including such measurements.
Other questions concern the analytics by the compa-
nies that have to know how they perform interacting
with patients to adopt their product. 4. Use Cases
SUMMIT (http://summit.imag.fr) is a technology trans-
Query Rewriting as a Service Considering our ob-
fer project funded by the Auvergne-Rhone-Alpes region
jectives, we have proposed a rewriting service consisting
that addresses multi-clouds, intelligent data integration,
of three main components (see the centre in Figure 1)):
service level agreement and focuses on the context of
• Rhone a services composition module guided by multi-device environments in the medical context.
SLA requirements We have the following services available implemented
• An SLA observer that monitors the services con- during the servitization process of the ADEL platform
tinuously and evaluates QoS metrics (see Figure 1)): (i) the Healthcare providers with informa-
• A registry which is a services classifier that tags tion about the patients’ insurances; (ii) a service dealing
and ranks services with a quality index used by with information about patients’ medical procedures; (iii)
Rhone to choose the services to be composed CPAPs data services managing the observations about
given a query and its SLA specification. apnoea episodes during sleep collected when patients use
their devices; and (iv) computing services that provide
analytics functions deployed in different cloud providers.
3.2. Selecting trustworthy services
We implemented experiments as service-based queries
As aforementioned, we target a trustable integration; to coordinate the ADEL platform’s data services with
therefore, we describe the SLA aspects and how we inte- data processing operators to answer analytics questions
grated them into services compositions to answer ques- regarding the apnoea condition. Experiments focus on (1)
tions. Recall that we need to choose the services that will classifying patients according to their CPAP frequency
provide data, ensuring SLA expectations for answering of use (i.e., compliance). The objective is to observe the
a question. For addressing SLA and given that we work evolution of their physiological metrics as they use their
for medical applications, we considered a definition of CPAP. (2) Other experiments address the study of metrics
Trust that includes service performance and data quality, regarding the apnoea condition and the use of CPAPs in
knowing that there is no or few access to meta-data what time seeking behaviour patterns. (3) Finally, experiments
we call black-box services. are devoted to predicting the evolution of patients’ con-
In this context we addressed three questions: dition according to the evolution of their physiological
status and the use of their treatment.
• P1. What is the appropriate model for describ-
ing individual data services trust using service
performance and data quality factors? 4.1. Services based analytics queries
• P2. How to collect the necessary information for
The first aspect to consider is translating questions ex-
this trust evaluation model?
pressed in natural language by topic experts into queries.
• P3. How to define data quality metrics using the We have used reference questions of the SUMMIT project
collected information? partners and translated them into service coordinations
To this end, we propose a data quality observability that implement queries. In this project, we do not address
protocol [8], defining data timeliness metrics for data the problem of processing natural language specifications.
services as a black box. The overall objective is enabling Still, we assume that we have a query language like the
data consumers to select the most reliable data service one proposed in our previous work [9, 10].
according to their needs providing a trust-sorted list of
data services. Thus, the system is composed of three Classification queries. Q-1: Are there any key charac-
main modules: teristics that differentiate patients whose compliance is less
than 2 hours/night, between 2 and 4 hours/night and more
- performance measuring module, which collects
than 4 hours/night on average?
and measures performance metrics;
Are there any key characteristics that differentiate patients whose compliance is less than
2 hours/night, between 2 and 4 hours/night and more than 4 hours/night on average?
provider
Health-care
avg(freq) < 2H/nuit
Prestataire
Filter
Avg frequence
by patient
patients(P) from allPatients
2H/nuit 2H/nuit
provider
Health-care
Prestataire
Filter
Assess
Filter
Union Assess
AVG/ U,F,AV,GB Classify
groupby Classify
Figure 2: Classification query.
A general services composition schema to answer this Other questions that we implemented concern aggre-
question can be the following (see Figure 2): gation queries like Q-3: where physiological metrics
1. Fetch data from health care providers. collected by CPAPs are analysed to observe their evo-
lution along time (see below). Filters can be applied to
2. Union, avg and group by the use of the device by
select data items for specific intervals. Data preparation
patient.
can be done first to see the data distribution along time
3. Filter the right duration and then classify patients
and include assiduity factors.
according to the average duration of use and as-
sess the classification quality. • Q-3: Evolution of pressures, leaks, AHI over time
The whole data preparation and engineering step can under treatment.
be implemented by composing services or a composition • Q-4: How does compliance changes over time since
ready to use. Similarly, the filtering process involves 3 initiation of treatment. Which factors influence
filtering tasks that can be executed in parallel by one or compliance (including type of mask and brand of
different computing services. CPAP).
The first part of Q-4: is implemented under the same prin-
Modelling and correlation queries. We expressed ciple of the Q-3 . Then other operations can be applied to
queries that included joining and filtering data provided address the second part and determine the factors that
by different services as data preparation tasks. Then the can influence compliance. To answer the second part of
resulting data sets are used as input to identify the role Q-4: a service of type health care provider must be used
of several attributes in a level of apnoea condition and to fetch the history of the CPAPs versions and brands
discover their role in this value. that every patient has tested.
The first tasks of all the service coordinations express-
• Q-2: Correlation between Epworth1 and the apnoea-
ing queries (see Figures 2 and 3) start by fetching and
hypopnea index (AHI) at diagnosis. Figure 3 shows
preparing data by selecting, filtering (e.g., items with(out)
the corresponding service coordination that im-
specific values) or projecting data and then applying op-
plements this question. Data are fetched in par-
erations like the union. These operations can be executed
allel from two providers, managing the apnoea
in parallel or sequentially become the input operators
data collected from the CPAPs and data from the
(tasks) that can infer/discover/model or compute aggre-
patients’ records. The coordination in the fig-
gations.
ure is generic, but there can be other possibili-
ties. For example, having two parallel sequences
of "fetch, filter and project operations" for each Prediction queries. On the basis of clinical data,
service provider, joining both results and finally including history and self-administered questionnaires
applying a model that can estimate correlation. (OSFP), is it possible to predict the severity of the condi-
tion with AHI<15; 15 30? Having data
1
The Epworth Sleepiness Scale (ESS) is a scale intended to mea- providers with labelled data collections can support pre-
sure daytime sleepiness that is measured by use of a very short diction queries. The tasks can include analysing the prop-
questionnaire. This can be helpful in diagnosing sleep disorders erties of the attributes/variables of the data (e.g., linear
Date
API
Patient Carnet
fetch Epworth, IAH of the diagnosis
join project Filter Correlate
Manager
CPAP Data
fetch
CPAP App
Figure 3: Correlating Epworth and AHI for understanding the context of diagnosis.
dependency of attributes with the target variable). It or in only one cloud? Therefore, trust aspects can concern
can also include tasks to engineering the data fetched the cloud providers hosting the execution of a query. The
from providers. Then it can consist of the application of decision making associated with the deployment is an
prediction models and other assessment tasks. open issue that we are currently addressing.
4.2. Open Challenges: discussion and 5. Conclusions and Future Work
position
This paper introduced open challenges and possible direc-
Having worked on the design and implementation of tions for integrating data for answering analytics queries
service coordinations that can implement the queries, on multi-cloud environments. The problems discussed
the next step to address is the cloud or environment in are inspired by a concrete use case related to the analysis
which queries are executed. Besides, several services can of medical data to understand, model and predict the con-
be used and available for a given task or a given type dition of sleep apnoea. Our current work concerns the
of data. There is a decision making problem to address stabilisation, profiling, testing, and scaling of Rhone [6],
to choose the services that will execute the tasks of an algorithm for SLA guided data services composition.
a query. Our project can deploy services in different We are also consolidating services monitoring for com-
clouds and guarantee various services, including trust puting services trust indices based on technical metrics
guarantees. Thus, this concerns a rewriting process that and data quality metrics [8].
can generate a solution space for a given query rather We observe the following main perspectives: (1) Evolv-
than one coordination. ing towards trust-based data services recommendation
In our current work, we have proposed data quality (multi-cloud). (2) Addressing more database challenges
observability protocol and its associated service TUTOR like capitalising on case observations to design deploy-
[8, 11]. The overall objective is to select the most reliable ment patterns for data services compositions. (3) Moni-
services according to given needs and provide a trust- toring for collecting knowledge. (4) Proposing enactment
sorted list of services. and optimisation strategies.
The open challenge is to include the query rewriting
process within an optimisation process that can use
a cost model to rank or propose the coordinations that 6. Acknowledgement
can potentially provide expected results in the best condi-
tions possible. These conditions can include trust aspects This work is funded by the project SUMMIT, pack ambi-
associated with the services participating in a services tion program of the region Auvergne Rhône Alpes - P089
coordination. - 0718-184-ARA, https://summit.imag.fr.
Another critical challenge is to reason about deploying
a services coordination that implements a query. The
coordination process can be executed in a distributed
setting. Should the coordination run on different clouds
References ronments: A systematic mapping, in: S. Yangui,
I. B. Rodriguez, K. Drira, Z. Tari (Eds.), Service-
[1] C. Ba, U. Costa, M. Halfeld-Ferrari, R. Ferre, M. A. Oriented Computing - 17th International Confer-
Musicante, V. Peralta, S. Robert, Preference-driven ence, ICSOC 2019, Toulouse, France, October 28-31,
refinement of service compositions, in: Proceedings 2019, Proceedings, volume 11895 of Lecture Notes
of CLOSER 2014 International Conference on Cloud in Computer Science, Springer, 2019, pp. 237–242.
Computing and Services Science, 2014. URL: https://doi.org/10.1007/978-3-030-33702-5_18.
[2] M. Barhamgi, D. Benslimane, B. Medjahed, A query doi:10.1007/978-3-030-33702-5\_18.
rewriting approach for web service composition,
IEEE Transactions on Services Computing 3 (2010)
206–222.
[3] M. Lenzerini, Data integration: A theoretical per-
spective, in: Proceedings of the twenty-first ACM
SIGMOD-SIGACT-SIGART symposium on Princi-
ples of database systems, 2002, pp. 233–246.
[4] U. S. Costa, M. H. Ferrari, M. A. Musicante, S. Robert,
Automatic refinement of service compositions, in:
International Conference on Web Engineering,
Springer, 2013, pp. 400–407.
[5] R. Pottinger, A. Halevy, Minicon: A scalable al-
gorithm for answering queries using views, The
VLDB Journal 10 (2001) 182–198.
[6] D. A. Carvalho, P. A. S. Neto, C. Ghedira-Guegan,
N. Bennani, G. Vargas-Solar, Rhone: A quality-
based query rewriting algorithm for data integra-
tion, in: East European Conference on Advances in
Databases and Information Systems, Springer, 2016,
pp. 80–87.
[7] K. Benouaret, D. Benslimane, A. Hadjali,
M. Barhamgi, Fudocs: A web service com-
position system based on fuzzy dominance for
preference query answering, Proceedings of the
VLDB Endowment 4 (2011) 1430–1433.
[8] S. Romdhani, G. Vargas-Solar, N. Bennani, C. G.
Guegan, Qos-based trust evaluation for data ser-
vices as a black box, in: C. K. Chang, E. Dami-
nai, J. Fan, P. Ghodous, M. Maximilien, Z. Wang,
R. Ward, J. Zhang (Eds.), 2021 IEEE International
Conference on Web Services, ICWS 2021, Chicago,
IL, USA, September 5-10, 2021, IEEE, 2021, pp. 476–
481. URL: https://doi.org/10.1109/ICWS53863.2021.
00067. doi:10.1109/ICWS53863.2021.00067.
[9] V. Cuevas-Vicenttín, G. Vargas-Solar, C. Collet,
N. Ibrahim, C. Bobineau, Coordinating services
for accessing and processing data in dynamic en-
vironments, in: OTM Confederated International
Conferences" On the Move to Meaningful Internet
Systems", Springer, 2010, pp. 309–325.
[10] V. Cuevas-Vicenttín, G. Vargas-Solar, C. Collet,
Evaluating hybrid queries through service coor-
dination in hypatia, in: Proceedings of the 15th
International Conference on Extending Database
Technology, 2012, pp. 602–605.
[11] S. Romdhani, N. Bennani, C. G. Guegan, G. Vargas-
Solar, Trusted data integration in service envi-