=Paper=
{{Paper
|id=Vol-1979/paper-24
|storemode=property
|title=Towards a Model-based Collaborative Framework for Calibrating Database Cost Models
|pdfUrl=https://ceur-ws.org/Vol-1979/paper-24.pdf
|volume=Vol-1979
|authors=Abdelkader Ouared,Yassine Ouhammou,Ladjel Bellatreche
|dblpUrl=https://dblp.org/rec/conf/er/OuaredOB17
}}
==Towards a Model-based Collaborative Framework for Calibrating Database Cost Models==
<pdf width="1500px">https://ceur-ws.org/Vol-1979/paper-24.pdf</pdf>
<pre>
Towards a model-based collaborative framework
     for calibrating database cost models

         Abdelkader Ouared1,2 , Yassine Ouhammou3 , and Ladjel Bellatreche3
                      1
                         Ibn khaldoun University of Tiaret, Algeria
         2
             National High School for Computer Science (ESI), Algiers, Algeria
                       3
                         LIAS/ISAE-ENSMA, Futuroscope, France
                                    a ouared@esi.dz
                      {yassine.ouhammou, bellatreche}@ensma.fr


       Abstract. The database systems optimization is often driven by us-
       ing cost models during the physical-design phase. A large panoply of
       database cost models have been proposed, where each one depends on
       various kinds of systems parameters (platform, software system, database
       schema, etc.) and can be related to various resources (e.g. processors,
       memories, networks). Currently, the development of cost models is still
       time-expensive due to the time allowed for calibrating and tuning pa-
       rameters in order to obtain cost models close to the practical usages of
       database systems. Moreover, a cost model can concern simultaneously
       many resources, hence its development may require to be done in a col-
       laborative way between actors who complement each-other since each
       one of them can be expert in a particular resource domain.
       In this paper, we propose a framework which helps cost model designers
       to work collaboratively and enables them to automatize and ease the tra-
       ditional work-flow dedicated to calibrate cost model parameters through
       open-source DBMS (database management systems). This framework is
       compliant with PostgreSQL and it is implemented as model-based in-
       frastructure thanks to model-driven engineering settings.


1     Introduction
Evaluating the performance of database systems has become more and more primor-
dial in the era of big data. It helps to check if the non-functional requirements of a
database system are met or not. This evaluation can be performed in two ways. The
first one is the application of real experiments on the system under design. The second
one consists on simulating experiments by using formal mathematical models enabling
the quantification of metrics (e.g., response-time, energy, throughput). These models
are called cost models. In this paper, we focus on the work-flow of the cost models
development.


1.1    Context and work positioning
Nowadays the development of database systems deals more and more, in addition to
functional requirements, with non-functional requirements (like response delays and
energy consumptions of queries). In order to obtain database systems that meet these


Copyright © by the paper’s authors. Copying permitted only for private and academic
purposes.
In: C. Cabanillas, S. España, S. Farshidi (eds.):
Proceedings of the ER Forum 2017 and the ER 2017 Demo track,
Valencia, Spain, November 6th-9th, 2017,
published at http://ceur-ws.org
kinds of requirements, several physical optimization structures (e.g., indexes, partitions,
materialized views, etc. [2]) can be used depending on the DBMS (database manage-
ment systems), which is selected for the database system and also on the database
architecture [1]. However, the selection of the appropriate physical optimization struc-
tures is np-hard since it is considered as a decision problem [2]. Thus, to reduce this
complexity several researches have adapted the optimization algorithms [7] [2] (e.g.,
greedy algorithms, evolutionary algorithms, linear programming algorithms) in order
to find optimal solutions (i.e. physical optimization structures). These solution can not
be approved by database system-designers without using appropriate cost models that
quantify the benefit of the selected physical structure.
    Each cost model integrates the maximum of information to provide a precise mea-
sure. This information is usually related to the system configuration: the database (e.g.
size of tables, number of tables, etc.), the workload (e.g. number of joins, selectivity
factors of predicates), the deployment platforms (e.g. centralized, parallel, cloud) and
the storage layout (e.g. column store, row store). Since 1980s many cost models have
been proposed, each one is dedicated to a specific database system configuration, and
allows to calculate a specific metric (e.g., response-time, energy, throughput). Besides,
the technology evolution motivates researchers to adapt and revisit existing cost models
by integrating new configuration parameters brought by that evolution. An interesting
manner that enables researchers to get adapted cost models is when this adaptation
is backed by the usage of an open-source DBMS, such as postgreSQL4 . Indeed, cost
model developers adapt the DBMS source-codes of existing cost models in order to well
calibrate parameters, hence obtaining new refined cost models. Figure 1 sketches the
work-flow of the cost model construction based on the usage of an open-source DBMS.


              Fig. 1. Conceptual work-flow of the cost model development

    First, the cost model developer(a.k.a. modeller or designer) defines a blueprint
version of a given cost-model (Step 1 of Figure 1). Then, he/she implements the cost
model as a program-code based on the DBMS code (Step 2 of Figure 1). Hence, this new
implementation can be launched in order to get measures of desired metrics depending
on a given workload (Step 3 of Figure 1). Based on these measures the developer
can calibrate the former cost model and then defines a new refined one (Step 4 of
Figure 1). In this step, the cost model designers can be assisted by various techniques,
such as machine learning settings, to tune parameters and also to deduce their inter-
dependencies. All these steps can be repeated many times till obtaining a cost model
that matches the developers needs.


4
    PostgreSQL is a relational database system with a fairly traditional architecture.
1.2   Problem statement

As it is shown in the process illustrated in Figure 1, the development, calibrating and
adaptation of cost models can be laborious and time-consuming since the process is not
automatic (i.e. it is done manually) and it requires to be repeated many times. More-
over, every iteration of this classical process can be error-prone, due to complexity of
cost models parameters (algorithms, platform characteristics, database features) which
have a significant impact on the quality and the correctness of the cost models[14].
    This current manner of development is also penalizing the collaboration between
designers of the same cost model when the development of this latter requires dif-
ferent domain experiences. On the one hand, the collaboration is hard to be done
synchronously. That is, merging, managing conflict and versioning actions have to be
done manually, which can be error-prone. On the other hand, there is no standard
language enabling to have a cost model as an abstract description in a machinable
format.


1.3   Paper contributions

There has been a lot of work vastly explored building and adapting database cost
models. However, to the best of our knowledge, there is no work that provides an
infrastructure helping designers to define and to adapt cost models collaboratively.
     To fill this gap, in this paper we propose a framework that allows cost model de-
signers to define and to generate automatically the appropriate programs based on
source-code of PostgreSQL. The framework offers three main services: (i) the descrip-
tion of cost models, (ii) the multi-utilization to ensure synchronous collaboration, and
(iii) automatic code generation. Our proposition is based on model-driven engineer-
ing settings. Hence, we take benefit from our recent work called CostDL (cost model
description language [19]) which is a description language dedicated to express cost
models as formal abstractions. Also, we use model repository called CDO (Connected
Data Objects) [10] to ease the management of conflicts related to the collaboration and
versioning. The automatic code generation is based on a transformation process which
enables to transform cost models to code that corresponds an open-source DBMS,
which is PostgreSQL in this paper.


1.4   Paper outline

The rest of the paper is structured as follows. Section 2 presents the background. In
Section 3 we present our proposed approach. Section 4 highlights the implementation
of our proposition and shows its applicability and its benefit. Section 5 is dedicated to
discuss the related work. Finally, Section 6 summarizes and concludes this paper.


2     Background

This section presents the scientific baseline of this paper. First, we give a brief de-
scription of cost models advances. We set out challenges faced by cost model designers
through a discussion. Then, we introduce the benefits of the model driven engineering
paradigm. Moreover, a description related to our recent work [19] is provided and its
interest is motivated and illustrated by an example.
2.1   What is a cost model?
Since 1980s, a lot of cost models (CMs) have been proposed (e.g. [22, 16, 23, 4]). We
summarize some of their principles hereafter.
    Regularly, database end-users manipulate different kinds of operations (such as:
join, scan, sort, etc.), where each operation execution costs in terms of response-time,
memory-size and/or energy. The cost of an operation is related to the computational
complexity, the platform characteristics and various database features. Thus, each CM
is dedicated to be used in a particular context characterized by parameters related to
the architecture layers (for instance: operation system layer, data system layer, access
method layer, buffer manager layer and storage system layer) [5]. Moreover, a CM is
also characterized by a mathematical formula, which depends on parameters presented
in the context (such as: database size, indexes, disk-layout). This mathematical formula
may be derived from a set of other basic math formulas, where every one represents
a logical or physical cost. While the former is related to properties of data processing
regardless of the deployment layout, the latter quantifies the impact of the hardware
parameters such as memory-size and block-size [16].


2.2   Discussion and motivations
The underlying idea behind this paper was inspired by our experience in developing
cost models. In our laboratory, we use to develop cost models where the majority of
them are developed from scratch. For instance, we recently proposed cost models in
[20] [21] by using the query optimizer of the PostgreSQL. The development of such cost
model lasts for several weeks for various reasons. (1) We have studied the C-programs
of PostgreSQL before coding manually the cost model. (2) For a refinement reason,
we have repeated this coding process several times until achieving a good execution
performance and a good result quality. (3) While the design of the cost model, in ques-
tion, concerned different aspects (e.g., database and hardware), we worked together in
order to help each-other and to combine our contributions. However, because of the
hard-coded approach that we followed to develop our cost models, we were obliged to
progress sequentially and alternatively instead of progressing in parallel and simulta-
neously. (4) Besides, even if some solutions of code versioning like Git [6] and SVN
[8] ease to manage the conflicts in case of multi-developers, its utilization requires to
be familiar with the code produced by other collaborators which is not intuitive. Un-
fortunately, these enumerate reasons are the common difficulties and challenges that
cost model designers face. Moreover, that also penalizes the reuse of these efforts. For
instance, if one decides to replace PostgreSQL by another open-source DBMS, one has
to code again a program according to the programming language and the source-code
of this DBMS.


2.3   Model-driven engineering
MDE (Model-driven engineering) [13] is a paradigm that makes the systems software
development focusing on the utilization of models as a cornerstone. Hence, models are
considered as centric-entities of a generative process. MDE allows the overcoming of the
growing complexity of systems. In other words, MDE offers the possibility to describe
a system with the help of a series of models. More abstract models are refined into
more concrete models by model transformations that ensure that the information from
higher-level models is retained in lower-level models [11].
2.4   Existing work: CostDL in a nutshell

In order to define CMs in a structured way and also to be able to compare them,
every CM must be a software entity. Consequently, we have used MDE settings and
have proposed CostDL which is a cost models description language [19]. CostDL is a
meta-model (see Figure 2) that allows CM designers to define their models based on
several parameters related to the database, queries, and the platform. CostDL includes
MathML [3] in order to help designers to define formally all mathematical functions of
a CM. A tool has been developed and it allows designers to express CMs, where each
one is an instance conforms to the CostDL language.


                     Fig. 2. Meta-modeling and modeling levels


                Fig. 3. Excerpt of CostDL meta-model: core entities


Example In the following, we present a cost model CMexample that has been proposed
by D. Bausch et al. [4]. Let CostI/O (hashjoin) be the cost formula of CMexample . It
enables to measure the I/O response-time for hash-join operations, where:
        CostI/O (hashjoin) =k Ri kp crw + k Ri kp csr + k Ro kp (crw + csr )            (1)
    Figure 4 shows the expression of CMexample in CostDL language. The instance,
that conforms to the CostDL meta-model, contains the different parameters related to
the CMexample context. Figure 4 also shows that the cost function is composed by a
set of logical and physical costs which are inputs of the math formula.


          Fig. 4. Excerpt of the cost model CMexample expressed in CostDL

    This section summarized the background helping to ease the understanding and
leading to support the contributions. Thus, the contributions of this paper aim to
capitalize efforts provided for designing cost models. First, we aim to automatize the
process by using MDE techniques to take benefits from the model abstraction, the
code generation and the reuse capabilities. Secondly, we aim to ease the management
of collaboration conflicts by rising them from the codes up to modeling level.


3     The Conceptual Framework for calibrating CMs
This section is devoted to present the contribution materials. First of all, we will present
an overview of the framework called CF-CM (Conceptual Framework for Calibrating
cost Models). Then we present our approach to examine and synthesize concepts exist-
ing in to PostgreSQL programs. In addition, we will expose the mapping relations in
order to bridge the gap between cost models expressed in CostDL language and their
C-codes in PostgreSQL. Finally, we will explore different solutions that can help us the
manage the collaboration before opting for the appropriate one.


3.1   CF-CM Overview
Since the paper contributions have led to the development of the framework called
CF-CM, we first give an overview of the capabilities that it provides. CF-CM offers
the following services: (i) Service 1: edition of cost models expressed in a machinable
formalism. (ii)Service 2: generating programs from any cost model expressed by Service
1. The generated programs have to be compliant with open-source DBMS. (iii) Service
3: in case of a collaboration, this service helps users to combine their modifications to
be included in a global cost model.


                    Fig. 5. Overview of the CF-CM Infrastructure

     Figure 5 shows all these services and how they communicate to each other. Indeed,
Service 1 is based on the CostDL language [19]. Then, this service allows expressing
the cost models as instances of the [19] meta-model. That will help to reuse, compare
and share the cost models under the same formalism by downloading and uploading
them. Service 1 is dedicated to be used by only one designer, but we can have many
instances of Service 1 as collaborator-designers. In this case Service 3 is required. That
is, every designer can download the last version of the cost model. Then, the designer
can modify it and commits the updated version at run-time. Service 3 guarantees to
impact all cost model instances edited by the distributed collaborators. The advantage
of this service is to manage the conflicts at a model-level instead of comparing codes
at a program-level. The benefit is that during the modeling process we assume that
all collaborators’ mastery and understand CostDL language since it contains a limited
set of concepts which are common to all designers. Another important point is that
contrary to programs where the modifications may impact several code-pieces scattered
in different files, the modifications through our framework impact only one file. Once a
draft version of a cost model is obtained, designers can transform it to programs. This
transformation is provided automatically by Service 2. The transformation is based on
two meta-models. The first one is the CostDL meta-model which represents the gener-
ation source. The second one is the meta-model of the PostgreSQL API (Application
Programming Interface) which is the generation target (see Section 3.2). The trans-
formation does not generate only programs, but it also integrates them by modifying
the appropriate PostgreSQL files according to the input cost model. While Service 1
is based on our recent existing work, in the following we will focus on the cornerstone
elements of Service 2 and Service 3.
3.2   CostDL to PostgreSQL transformation
In this section we present the CostDL to PostgreSQL transformation, which is the
second step in the approach presented in Figure 5. We first introduce the PostgreSQL
API metamodel and then, we focus on the transformation itself.


PostgreSQL API In order to transform any cost model conforms to CostDL lan-
guage to PostgreSQL C-programs, we have first studied the structure the PostgreSQL
API (Application Programming Interface). The underlying ideas behind this study is
to achieve two objectives. The first one consists in making the transformation as au-
tomatic as possible by targeting key API elements. The second objective is to have
a flexible transformation that can be easily extended. Therefore, the skeleton of the
PostgreSQL API has been summarized as it is shown in the model illustrated in Figure
6.


                Fig. 6. Excerpt of PostgreSQL API conceptual model

    It shows the conceptual model of API skeleton that is composed of a set of classes to
manipulate DBMS PostgreSQL. In this metamodel, the global mathematical formula
may be derived from a set of other basic math formulas, where everyone represents a
logical, physical or algorithm cost. While the cost model to compute a performance
metric of a given query by assemblies their costs [16]: (i) Logical Costs Estimating the
size data volumes of a given database operations, we distinguish three data volumes:
input (per operand), output, and temporary data or data stream. Data volumes are
usually measured as cardinality, i.e., number of tuples. (ii) Algorithmic Costs extend
logical costs by taking also the computational complexity (expressed in terms of O-
classes) of the algorithms into account (NestedLoop-Join, Hash-Join,... ). (iii) Physical
Costs finally combine algorithmic costs with system/hardware (CP U , I/O and N et)
specific parameters to predict the total costs in terms of execution time.


The CosDL Language Each user uses common parameters to describe his/here
cost models. In our scenario, the initial instance cost model would conform to CostDL
language dedicated to describe database cost models [19]. A cost model CM is char-
acterized by four elements: its cost types, its context, its cost function and its
performance metric kind.
     A class diagram cost model is defined as a tuple CMi =< CTi , cxti , f unci , mi >,
where CM is the set of classes and is characterized by four elements: its cost types,
its context, its cost function and its performance metric kind. CM calculate the value
of the cost model metric mi ∈ {Response time, Energy, Size}. The cost type CTi
represents the database component considered by the cost model, then the CTi ⊆
{CP U, I/O, M emory, N etwork} set.
                                        S = {p1 , p2 , .., pn }. Each pi ∈ P = PDatabase ∪
     A context is a set of parameters cxt
PHardware ∪ PArchitecture ∪ PQuery = cxti , where every Pj is a set of parameters that
belong to one of the four categories. Moreover, PDatabase × PHardware × PArchitecture ×
PQuery = {cxt1 , .., cxtm } = CXT is a set of all possible contexts of cost model universe.
     The cost function f unci , of a given cost model CMi , permits computing the value
of the cost model metric mi based on a subset of the context parameters cxti . The
cost function is a tuple f unci =< P arami , mfi >. P arami ⊆ cxti (subset of the cost
model context), and mfi is the mathematical formula, where mfi : P k → R+ and k ≤
cardinal(P arami ).

CF-CM Transformation The CF-CM process is the code artifacts generation
that creates a corresponding C-class via the Blueprints API (see Fig. 6). By using
the Blueprints API, the CF-CM generator creates a cost model that contains the
parameters and formulas. We provide a list of mapping rules necessary derive cost
model from CostDL concepts (see Tab 1).
                 Table 1. CostDL to A Cost Model of PostgreSQL

   CostDL Concepts              PostgreSQL Concepts                      Comments
Algebra Operation Op       relational algebra operations       one-to-one correspondence
Physical Parameters :      cost units:                         expressing physical
pi ∈ (PHardware ∪ PArch. ) cs , cr , ct , ci , and co          parameters of the machine
Logical,Parameters:        Selectivity parameters:             statistical values of database
pi ∈ (PDatabase ∪ PQuery ) ns , nr , nt , ni , and ni          (e.g. tuples, width)
                           void set joinrel size estimates()
                           double relation byte size()
                                                               assignop,alg (card, fcard,op )
                           - Selectivity clause selectivity()
LogicalCost                                                    Assign cardinality functions
                           - void set joinrel size estimates()
                                                               for basic operations
                           double relation byte size()
                           void initial cost operation()
                                                               assignIO,CP
                                                                       op,alg
                                                                              U
                                                                                (Cost, fIO,op )
                                                               methods to assign CPU/IO
                           IO-Cost()
PhysicalCost                                                   cost to each basic operation
                           CPU-Cost()
                                                               depending also on the used
                                                               implementation algorithms.
=,>,>=,<,<=,<>             =,>,>=,<,<=, !=
+, -,/, %, *               +, -,/, %, *
and,or,not                 &&,||, !                             one-to-one correspondence
variable                   variable
literals                   literals
    These mapping rules have been operationalized in the model-to-model transforma-
tion. This translation is done using plug-in Acceleo available within Eclipse environ-
ment. In this work the instance is then translated into an C-program using model
to text transformation process. Thanks to the interoperability facilitated by MDE, we
can imagine a variety of further usages of cost models. We assume that users can de-
velop user-friendly tools to exploit shared cost models. For instance, in our laboratory
we have developed a third party tool that transforms XMI files corresponding to cost
models to a C-program based on the PostgreSQL API (see Fig 6).
    Listing 1.2 shows the generation code of the example presented in fig. 4 (see Section:
2.4). At the end of the design, one can check the conformity of the cost model. For this,
a set of structural rules have been injected in the meta-model. These rules are expressed
as OCL (Object Constraint Language)[18] invariants. Listing 1.1 is an example of a
structural rule. It means that all physical costs and logical costs, which are inputs of
a given cost function, have to be referenced as MiType instances in the MathType (see
[19] for more details) instance of that cost function.
Class CostFunction
self . g l o b a l m a t h e m a t i c a l f o r m u l a . mi - > includesAll ( self . logicalcost )
and self . g l o b a l m a t h e m a t i c a l f o r m u l a . mi - > includesAll ( self . physicalCost )

                                 Listing 1.1. An OCL structural rule

    This generator tool has been tested in [20]5 . That is, by using Acceleo language
we have developed a code generator which generates C-programs based on PostgreSQL
API. The objective is to generate the C code of every cost model conforms to CostDL
[19]. As an example, the CMexample of the running example section (Section 2.1) is
translated to functions implemented in C to measure the cost of the hash-join operation.
The following listing represents an excerpt of the generated result.
     ...
void costhashjoin ( Path * path , PlannerInfo * root ,
RelOptInfo * baserel , ParamPathInfo * param_info )
{
  Cost      startup_cost = 0;
  Cost      run_cost = 0;
  double      spc_seq_page_cost ;
  QualCost qpqual_cost ;
  Cost      cpu_per_tuple ;
  Assert ( baserel - > relid > 0) ;
  Assert ( baserel - > rtekind == RTE_RELATION ) ;
     ...
  if ( param_info )
  path - > rows = param_info - > ppi_rows ;
  else path - > rows = baserel - > rows ;
  if (! enable_ seqscan )
  startup_cost += disable_cost ;
       ...

                           Listing 1.2. An excerpt of C generated code


4     Proof of Concept and Tooling

To stress our approach and to proof how it is useful and helpful, this section is devoted
to present a global usage scenario of the CF-CM framework. In parallel, technical
implementations are highlighted. The usage scenario is organized as it is shown in
Figure 7.
5
    http://www.lias-lab.fr/forge/projects/ecoprod
                          Fig. 7. Technical implementation

4.1   Tool Support of CF-CM

Our approach has been implemented and made available as web system based on
Eclipse Tooling Support. Eclipse projects have become a standard because it supports
the main technologies from OMG. Eclipse tools have with EMF models that support
DSL Tools and collaborative work on distributed model repositories like EMFStore6
and CDO7 . It supports several DBMS, including relational and NoSQL databases. Our
implementation is based on open source project: Connected Data Objects (CDO). This
technology provides support to establish remote repositories and the functionalities to
work with them ensuring consistency and security is needed. CDO is a semi-automated
persistence framework that works natively with Ecore models and their instances.
    We implement its common functionality by adapting Tooling CDO for: Storing/
sharing cost models, live collaboration with CDO and versioning support and compar-
ison. The four pillars of the proposed infrastructure for CF-CM are: (i) Cost Model
Editor Service, (ii) Sharing/ Collaboration, (iii) Code Generating and (iv) Repository
Management. To ease the understanding, we provide the URL of a demonstration video
of the project 8 .


Cost Model Editor Service The cost model editor is composed of several user
interface. It manages cost models in the repository, thus it permits users to upload,
download, delete cost models, and even search cost models conforming to a specific
metamodel. The cost model editor service permits users to install them locally, at this
stage of the project cost model we do not refer to on-line. By creating a local clone of
a repository. This repository is synchronized with the remote repository as soon as the
connection is restored. Cost Model Editor Service provides two major services: seeking
cost models and sharing cost models. There are two possible ways to create a cost
model expressed in CostDL: via an offline editor or via an online Editor. The offline
6
  http://www.eclipse.org/emfstore
7
  http://www.eclipse.org/cdo
8
  The demonstration video of our tool is available at: https://youtu.be/rHIVvJEOrbM
editor is a tool based on Java EMF (Eclipse Modeling Framework) API and has been
integrated as a plugin-in Eclipse 9 which is an integrated development environment
(IDE). Through the editor tool, every cost model instance is saved as an XMI (XML
Metadata Interchange [9]) file.


Collaboration and Sharing CDO is Huge Ecosystem on top of EMF, another
advantage of model repositories is the possibility to work collaboratively on cost models.
A distinction between offline and online collaboration (on the Web) can be made. Offline
collaboration can be regarded as the classical process for source code versioning. A
cost model can be shared and checked out afterwards. Then changes can be made and
committed. The usage scenario is organized as it is shown in Figure 7. The cost model
is fragment into smaller components ease the development because every developer can
be specialized in a specific part of components (CPU, I/O, Network, Join, sort etc.)
and they make different actions in the cost model (calibration, extended, adapted).
Consequentially, user has the ability to load only parts of the cost model that are
required.


Repository Administration CDO provides an API 10 to manage repositories
remotely. We exploit the API provided by CDO and integrate these functions with a
graphical user interface to ease the repository. In order to manage the security system,
an entity which will create user accounts, will change passwords and will set user rights
is needed. The security manager needs to have a special account to log into the system.
In addition, to manage the repositories, this role has to be in charge of to create and
delete repositories About cost model versioning, old revisions of cost models can be
restored. CDO implements the Audit View which can offer a way to get versioning.
It is possible with this CDO View to see previous versions of the cost models in the
repository, it would be possible to use a previous version if it is necessary. Based on
CDO revisions, CDO supports the concept of optimistic versioning. This means that
An developer first change the state of cost model locally before resynchronising the
changes with the server witch keep a history of the different states of the cost models.
Note that it’s possible to save or export an old state of the model in an XML file and
load the model if it is necessary.


5      Related Work
The cost models in database systems have become an active research topic. There has
been a plethora of studies and initiatives by the research community. While reviewing
the literature of works considering the CMs consumers are usually researchers, indus-
trials and students. Usually, the CM designed and developed from scratch, we call this
design approach the Hard-coded approach. Therefore, the CMs must be newly designed,
implemented, and tested which leads to duplicate implementation efforts, and thus, in-
creased development costs; so, the possibility of errors is rather high. This scenario,
in which a developer or system designer wants to create a CM from scratch. Specifi-
cally, this can be pushed by (i) the advance in database technology (e.g. column store,
parallel execution) (ii) the progress in hardware technology (e.g. new CPU or GPU)
(iii) or the upcoming of new NFR (e.g. the energy, system sizing). Thus, this is the
9
     Eclipse Modeling Project. www.eclipse.org/modeling/
10
     org.eclipse.emf.cdo.common.admin
most challenging and resource intensive scenario, the developer or the designer have to
revisit all layers impacting in the system, and to extract the relevant parameters with
their respecting basic units to build accurate CMs, in other hand development pro-
cess of CM requires a deep knowledge related to many aspects: databases, hardware,
machine learning for calibrating its parameters, statistic to estimate parameters such
as selectivity factors of join predicates, etc. For example, we can cite some advanced
studies and tools developed in the top of PostgreSQL: the tool Parinda [15] that offers
designers an interactive physical designer for PostgresSQL, PET - an energy-aware
query optimization framework [24], the work of [14] that challenges certain theories
regarding query optimizers, and OntoDBench for evaluating semantic databases [12],
and SQOWL - a tool to performing type inference over data stored in PostgreSQL
[17]. Certainly, we cannot deny the strong usage of PostgreSQL by the research and
teaching communities, but its CMs are ad-hoc and -coded (C-programs). Then, their
utilization is tedious and time consuming in order to find all formulas. Moreover, as
CMs proposed by PostgreSQL run only on PostgreSQL architecture, they can not be
used outside (e.g. for NoSQL architecture).
    There has been a lot of work vastly explored adapting database CMs. But neither
the database community’s attention in automatize the process of CMs building. Despite
numerous existing CMs only few ones can be used and re-used easily. Indeed, the
use of these CMs for performance analysis is still expensive since it requires a good
design expertise and being up to date knowledge. Therefore, we notice the automated
process via a workflow system using MDE paradigm to take benefits from the model
abstraction, the code generation and the reuse capabilities.


6    Conclusion

This article addressed the problem of collaborative development of database cost mod-
els. We have presented a CF-CM framework to automatize process of cost model
development. It covers the fundamental features required by an infrastructure for cal-
ibrating such cost model, summarized in the four pillars of cost model editor, code
generator service, sharing/collaboration and repository administration. Our solution
showed that it is possible to work online collaboratively on cost models entities based
on CDO. A Prototype implementation demonstrate the feasibility and practical utility
of the approach.
    Currently, we are testing our tool by PhD students in order to get their feedbacks
for possible improvements. Another ongoing work pursues adapting our framework
to cover multiple database system types, more precisely NoSQL. It also opens up
opportunities, for instance, in online teaching (MOOCs, etc.) by students following
our Advanced Databases course for training building cost models and to evaluate the
usability and usefulness of our solution.


References

 1. D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. Materialization
    strategies in a column-oriented dbms. In 2007 IEEE 23rd International Conference
    on Data Engineering, pages 466–475. IEEE, 2007.
 2. S. Agrawal, S. Chaudhuri, and V. Narasayya. Materialized view and index selection
    tool for microsoft sql server 2000. ACM SIGMOD Record, 30(2):608, 2001.
 3. A. Asperti, L. Padovani, C. S. Coen, F. Guidi, and I. Schena. Mathematical
    knowledge management in helm. Ann. Math. Artif. Intell., 38(1-3):27–46, 2003.
 4. D. Bausch, I. Petrov, and A. Buchmann. Making cost-based query optimization
    asymmetry-aware. In DaMoN, pages 24–32. ACM, 2012.
 5. T. Burns and other. Reference model for dbms standardization. SIGMOD Record,
    15(1):19–58, 1986.
 6. S. Chacon and J. Hamano. Pro git, vol. 288. Apress, Berkeley, 2009.
 7. S. Chaudhuri and V. Narasayya. Self-tuning database systems: a decade of
    progress. In Proceedings of the 33rd international conference on Very large data
    bases, pages 3–14. VLDB Endowment, 2007.
 8. M. Fischer and other. Populating a release history database from version control
    and bug tracking systems. In Software Maintenance, 2003. ICSM 2003. Proceed-
    ings. International Conference on, pages 23–32. IEEE, 2003.
 9. O. M. Group. Omg mof 2 xmi mapping specification. Version 2.4.1, http://www.
    omg.org/spec/XMI/2.4.1/, 2011 (accessed 06.03.17).
10. O. M. Group. Eclipse. the connected data objects model repository (cdo) project.
    Version 2.4.1, http://eclipse.org/cdo, 2012 (accessed 06.03.17).
11. C. He and G. Mussbacher. Model-driven engineering and elicitation techniques: A
    systematic literature review. In 24th IEEE International Requirements Engineering
    Conference, RE 2016, Beijing, China, September 12-16, 2016, pages 180–189, 2016.
12. S. Jean, L. Bellatreche, et al. Ontodbench: Interactively benchmarking ontology
    storage in a database. In ER, pages 499–503, 2013.
13. S. Kent. Model driven language engineering. Electr. Notes Theor. Comput. Sci.,
    72(4):6, 2003.
14. V. Leis and other. How good are query optimizers, really? PVLDB, 9(3):204–215,
    2015.
15. C. Maier, D. Dash, I. Alagiannis, A. Ailamaki, and T. Heinis. Parinda: an inter-
    active physical designer for postgresql. In EDBT, pages 701–704. ACM, 2010.
16. S. Manegold, P. Boncz, and M. L. Kersten. Generic database cost models for
    hierarchical memory systems. In VLDB, pages 191–202, 2002.
17. P. McBrien, N. Rizopoulos, and A. C. Smith. SQOWL: type inference in an
    RDBMS. In ER, pages 362–376, 2010.
18. OMG. Object Constraint Language. Omg available specification. Version 2.0,
    www.omg.org/spec/OCL/2.0/, 2006 (accessed 06.04.16).
19. A. Ouared, Y. Ouhammou, and L. Bellatreche. Costdl: a cost models description
    language for performance metrics in database. In Proceedings of the 21ST IEEE
    ICECCS. IEEE, 2016.
20. A. Roukh, L. Bellatreche, A. Boukorca, and S. Bouarar. Eco-dmw: Eco-design
    methodology for data warehouses. In ACM DOLAP, pages 1–10. ACM, 2015.
21. A. Roukh, L. Bellatreche, and C. Ordonez. Enerquery: energy-aware query process-
    ing. In Proceedings of the 25th ACM International on Conference on Information
    and Knowledge Management, pages 2465–2468. ACM, 2016.
22. P. G. Selinger, M. M. Astrahan, et al. Access path selection in a relational database
    management system. In ACM SIGMOD, pages 23–34. ACM, 1979.
23. R. Varadarajan and other. Dbdesigner: A customizable physical design tool for
    vertica analytic database. In 2014 IEEE 30th International Conference on Data
    Engineering, pages 1084–1095. IEEE, 2014.
24. Z. Xu, Y. Tu, and X. Wang. PET: reducing database energy cost via query opti-
    mization. PVLDB, 5(12):1954–1957, 2012.

</pre>