=Paper= {{Paper |id=Vol-2161/paper2 |storemode=property |title=Discussion Paper: Filling the Gap Between Business Rules and Technical Requirements in Business Analytics: The Fact - Centered ETL Approach |pdfUrl=https://ceur-ws.org/Vol-2161/paper2.pdf |volume=Vol-2161 |authors=Antonella Longo,Mario Bochicchio,Marco Zappatore,Lucia Vaira |dblpUrl=https://dblp.org/rec/conf/sebd/LongoBZV18 }} ==Discussion Paper: Filling the Gap Between Business Rules and Technical Requirements in Business Analytics: The Fact - Centered ETL Approach == https://ceur-ws.org/Vol-2161/paper2.pdf
 Discussion paper: Filling the gap between business rules
  and technical requirements in Business Analytics: The
             Fact - Centered ETL approach

         Antonella Longo1, Mario Bochicchio1, Marco Zappatore1, Lucia Vaira1
                 1 Set-Lab, Dept. of Engineering for Innovation, Univ. of Salento,

                      via per Monteroni, 73100 Lecce, Italy
    {antonella.longo, mario.bochicchio, marcosalvatore.zappatore,
                     lucia.vaira}@unisalento.it

Abstract. In real-time Business Analytics scenarios, like Business Activity Monitoring (BAM)
or Operational Intelligence, using modeling language to describe ETL processes is fundamental
to provide business users with up-to-date data in order to support decision-making process and
to optimize business operations. To bridge the gap between business and technical requirements
and encapsulate business requirements into technical specifications, it can be very convenient
to design ETL processes starting from business facts; this would effectively support business
users in decision-making processes and would represent business rules and entities as soon as
possible in the BI process design. Moreover a fact centered approach to ETL process design can
split the traditional black-box approach into several fact-centered flows, which can be pro-
cessed in parallel exploiting the advantages of distributed multi-threaded process models, typi-
cal of big data scenario. In this context we propose a control-flow- based approach to ETL
process modeling, which starts from business facts identification, and represents ETL processes
using BPMN notation, which is the foundation for machine-readable code. Our main contribu-
tion consists in a proposal for structuring ETL processes and related objects and in its applica-
tion to a business case. The paper has been already published at Procedia Technology, Volume
16, 2014, Pages 471-480, and this version shows only minor updates

        Keywords: Business Intelligence; ETL; Process & Conceptual Models; Opera-
        tional Intelligence.


1       Introduction

Business Intelligence (BI) can be defined as the process of getting information about
the business from available data sources [6]. It assists managers to take correct deci-
sions based on facts given at the right time/place throughout the life of the business.
   The exponential growth of data streams is a big challenge for BI and Business
Analytics (BA) as decisions based on fresh information represent a competitive ad-
vantage. This challenge is addressed by Real-time BI and Operational Intelligence
(OI) that provide visibility into Business Processes (BPs), streaming events, and oper-
ations as they are happening. These topics go under Big Data challenges, where the
new frontier of data management must combine big data volumes with event-centric,
speed up processing that allow delivering and accessing information with low latency.
Conceptually similar to OI, Business Activity Monitoring (BAM [7] is an enterprise


SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the author(s).
solution that allows monitoring business activity across operational systems and BPs.
It refers to the aggregation, analysis, and presentation of real-time data about custom-
ers’ and partners’ activities inside and across organizations. BAM is the real-time and
event-driven extension of BI: while BI products usually handle historical data stored
into a Data Warehouse (DWH), BAM technologies provide managers with real-time
business analyses obtained by operational data sources, more and more integrated
with cloud-based resources for achieving faster Return of Investments and lowering
costs [3]. The success of OI, BAM and BI systems depends mostly on the adequacy of
the populating system: ETL (Extract-Load-Transformation) processes are, therefore,
the critical feeding component of DWH/BI systems as they retrieve data from opera-
tional systems and pre-process them for further analysis [12]. An ETL system trans-
lates specific decision-making processes into system rules but there are several chal-
lenges: volatile business rules, heterogeneous operational data schemas, proprietary
ETL tools, different notations, complex ETL porting processes. As of today, there is
no widely adopted methodology covering the ETL development cycle, with an easy
notation for user profiles, allowing to map the model on the execution environment
and offering pre-implementation validation features. We believe that in order to
bridge the gap between business and technical requirements and encapsulate business
requirements into technical specifications, ETL processes should be designed by start-
ing from business facts (BFs), thus supporting business users in decision-making
processes and representing business rules and entities as soon as possible in the BI
process design. Such a fact-centered approach to ETL process design can split the
traditional black-box approach into several fact-centered flows, which can be pro-
cessed in parallel exploiting the advantages of distributed multi-threaded process
models, typical of big data scenario. In this paper, the advantages of control flow
models and BPMN notation are applied to ETL process design, as defined in [4, 1, 5].
First, design patterns allow structuring the process and selecting required data. Sec-
ond, a uniform notation for BPs and DWH modelling eases the communication be-
tween IT and business roles. Third, non-functional requirements (e.g., performance
indicators, data freshness) can be specified more easily. We propose to structure the
ETL design process starting from the identification of BFs and to define a reference
model for representing the ETL process. The integration between the fact-centered
approach and control flow models allows identifying relevant business objects and
isolating related facts in order to handle them independently.
    The paper is organized as follows. Section 2 analyzes existing works on ETL pro-
cess conceptual models; Section 3 discusses our case study; the proposed approach
and application are described in Section 4; conclusions are presented in Section 5.


2      Related Works

Several efforts have been proposed for the conceptual modeling of ETL processes,
including ad hoc formalism approaches based on standard languages like UML [10]
or MDA [11]. Conceptual models that use BPMN notation for ETL design have been
also already proposed. The work we present in this paper is built along the lines of [1]
and [4, 5]. In the former Akkaoui, Mazón and others present a conceptual model
based on the BPMN standard and provide a BPMN representation for frequently used
ETL design constructs. They propose a meta-model for conceptual modeling of ETL
processes based on the separation between control and data processes, and on the
classification of ETL objects resulting from a study of the most used commercial and
open source ETL tools. The issue of ETL conceptual view is also addressed in [4, 5];
authors propose the use of BP models for a conceptual view of ETL, and show how to
translate this conceptual view to a logical and a physical ETL view that can be opti-
mized. Our work differs from those described because we propose a fact centered
approach to ETL design which allows obtaining analytical up-to-date data in a timely
and efficient manner, and a control flows based modeling of ETL processes, repre-
sented with BPMN notation.


3      Preliminaries and running example

A BI system is based on data sources analysis, design and implementation of the
DWH that store analytical data and on related ETL procedures, according to its typi-
cal 4-level architecture (Fig. 1) [8]. The integration of heterogeneous sources is com-
posed of ETL procedures for extracting, integrating, cleaning, validating, filtering
data and then loading them into a staging area, posed in a reconciled model which
creates a central data reference model for the whole enterprise and introduces a clear
separation between extraction and integration problems and issues related to DWH
loading process. The DWH can be directly consulted or used as a source to build data
marts (i.e., data subsets/aggregations in the DWH). The concepts of interest in the
decision-making process are named facts, which typically match with events that
occur within the enterprise. Every fact represents a set of events, quantitatively de-
scribed by measures and analysis dimensions, detailed with hierarchies of attributes.
They are modeled as Dimensional Fact Models (DFMs) [8] so that the efficient con-
sultation of integrated data is available for reports, analyses and simulations.




                     Fig. 1. Data warehousing system architecture [8]

   ETL procedures require constraints checks for filtering data according to quality
rules in terms of business rules and format inconsistencies. Some these rules (e.g.,
exception handling, data workflow patterns) can be derived from control-flow pat-
terns. Wrong data feed a specific database that logs error types, source tables and
execution dates. In our approach we propose a method to structure ETL processes
based on relevant BFs, and to model them according to control-flow models.
   Let us now discuss our running example: an Italian company needs a Service Level
Agreement (SLA) management platform to calculate and present service levels asso-
ciated with fidelity cards printing and delivery services. Operational data from several
relational data sources must be mapped into hyper-cubes via an ETL process in order
to produce reports and booklets for controlling the quality of services according to
SLAs. Starting from the operational data sources, the system runs ETL procedures to
populate the reconciled model and then the hyper-cubes. Data stored into the opera-
tional database is about fidelity cards details, and related events, which correspond to
facts defined into the DWH. In this running case we describe the application of the
proposed approach to fidelity cards printing, delivery and complaint events.


4      The fact-based ETL process

BFs are business artifacts meaningful to business. They can be modeled as business
domain objects for transactional aims or as DFM for analytical processing. Once a BF
has been defined in the domain, it must be populated starting from heterogeneous data
sources. As in [1], we consider the ETL process as a combination of two perspectives.
First, a control flow process, which manages the branching and synchronization of the
flows, handles execution errors and exceptions and coordinates information exchange
with external processes. Second, a data process that sends operational to the DWH,
thus providing insights about inputs and outputs of each process element. Therefore,
in Fig.2 we propose our fact population data process, which exploits the conceptual
separation between quality management, data and fact population processes.




                               Fig. 2. ETL Process Model

   The Quality Management can be further specialized as it includes data quality,
ETL process and BP management; it can be considered as an event generator that
checks the compliance with given constraints on business rules, data formats, integrity
and ETL rules. The Quality Management handles quality checks within the whole
ETL process, at different levels, via transformations exploiting syntactic and semantic
rules, or business and ETL constraints. In particular, the Data Quality Management
phase considers the syntactic accuracy of collected data and applies syntactic con-
straints to extract and load data into the reconciled model. These rules allow discard-
ing incorrect data coming from errors that may affect the population process.
   In addition to the rules on syntactic data accuracy, ETL processes need the defini-
tion and application of business rules that assess data coherence/consistency within
the specific reference domain and which are defined into the Business Quality Man-
agement phase, which includes the procedures for applying business rules to data, in
order to filter records out of constraints. The ETL Process Quality Management aims
to filter data on constraints related to the ETL procedures (e.g., they verify correctness
of data produced by ETL modules to prevent duplicated or contradicting data).
   The output of the Quality Management process is the storage of filtered data into
the reconciled model according to the Fact management process or into the log data-
base that include a record for each violated constraint, with the following information:
source table name, record identification, error code, error description, date check.
This approach allows tracking data source errors, easily identifying reasons for rec-
ords’ discards and obtaining statistics such as the most violated constraint, and so on.
The Fact Management Process handles data entry into data structures that can be stag-
ing area tables of reconciled models or facts inside a DWH. In both cases, the popula-
tion process is constituted by a first phase of data extraction, followed by a transfor-
mation step and finally by the loading of transformed data into the specific reconciled
or fact tables. In particular, the Extract activity performs the initial collection of data
that will be transformed and loaded into the destination and for the integration of
multiple data sources, managing topics such as format/conceptual heterogeneity, due
to differences in communication protocols, data formats, schemas, vocabularies, etc.
The Transformation process handles activities related to the second step of the ETL
Process, which aims to obtain analytical data from operational ones, and to match
input data with output data structures. Typical ETL transformation procedures include
data summarization, derivation, aggregation, conversion, constants substitution, set-
ting of values to null or default based on particular conditions, and so on. Finally, the
Load module administers the population of the target table, which can be part of a
reconciled database, a fact or a dimension, including the management of primary
keys, integrity constraints and indexes.




                   Fig. 3. Conceptual representation of reconciled model

   In our model, we design an ETL process for each BF; therefore the whole ETL
process can be considered as a set of fact processes and quality rules checks, which
can be integrated and managed through control flow processes. We propose to design
ETL starting from the definition of BFs and to split ETL process in order to isolate
them within the flow. This approach: 1) allows managing facts independently, in a
framework providing the integration aspects; 2) helps to structure ETL design; 3)
allows business users working on their BFs even if the whole process is not complet-
ed yet. The result is the reduction of BF processing cycle, and the timely provision of
analytical data. Splitting and fastening the ETL process is one of the steps for speed-
ing BA process up enabling higher response times, which are fundamental in real-
time BA systems. Furthermore, ETL process representation based on control flow
models, and in particular on BPMN notation, adds information about accountability,
as it allows linking process activities with the involved user profiles. Profiles typically
involved in an ETL process include: 1) Data Provider (who assesses data readiness);
2) Data Quality Supervisor (who checks for data accuracy and correctness from a
business point of view); 3) Technician (who handles the technical cycle). Roles defi-
nition, together with a BPMN-based conceptual modeling of the ETL process, im-
proves accountability: at any time a problem occurs, managers are able to immediate-
ly know the activity that generates the problem and who is accountable for it.
   Now we apply the proposed approach to our running example described in the pre-
vious section. Starting from a portion of the conceptual schema of the reconciled
model, and from the DFM model corresponding to fidelity cards delivery, we illus-
trate the ETL process and the transformation that load the Delivery table of the recon-
ciled model, both modeled with BPMN notation.
   Fig. 3 shows the Entity-Relationship (ER) model that represents fidelity cards
printing, delivery and complaints events. The printing event is modeled by the rela-
tionship between Fidelity Card and Lot entities: a printing lot includes several cards,
and each card can be associated only with a lot. A delivery event is linked to each
fidelity card: delivery detailed data, such as delivery date or payment, are included
into Delivery entity, whereas Delivery Type entity is used to classify feasible delivery
typologies, according to adopted procedures. The relationship between Fidelity Card
and Complaint models complaint events: one or more complaint events can be regis-
tered on the same card, and each of them is linked to a single card.
   Once the reconciled model is populated, a second step of ETL procedures elabo-
rates data and store them into hyper-cubes. Fig. 4 (left side) shows the DFM for ana-
lytical data on fidelity cards delivery events (i.e, one of the outputs in the ETL pro-
cess). Delivery events can be analyzed along many dimensions (e.g., state, card, de-
livery and registration date, etc.). By applying the proposed approach to ETL process
that loads reconciled model tables, we represent the process in BPMN notation; Fig. 4
(right side) refers to fidelity cards printing, delivery and complaint events loading.
   The process is represented as a set of complex activities, aggregated in sub-
processes (each of them corresponds to a table of the reconciled model). Activity
execution is modeled through the Parallel Split workflow control pattern, used when a
single thread of control splits into multiple threads that can be executed in parallel [9].
In BPMN notation, the pattern is represented with a parallel gateway, to highlight the
fact-centered approach that allows independently running and handling single tables’
loading. The whole process ends when all the activities are concluded, according with
Synchronization workflow control pattern, used when multiple parallel activities con-
verge into one single thread of control, thus synchronizing multiple threads [9].
   Each activity of the previous ETL process representation is a sub-process that can
be further detailed and modeled with BPMN notation. As an example, Fig. 5 shows
the BPMN representation of the fact process related to the Delivery table loading
process. The fact process starts collecting operational data related to cards delivery
(Extraction phase), and this activity is in charge of the Data Provider. The process
goes on with two quality controls, through which card identifier validity and delivery
date existence are checked; the Data Quality Supervisor is accountable for these
checks, whereas the Technician is responsible for error handling and event insertion
into the appropriate table. Quality checks are modeled with the Exclusive Choice
workflow control pattern, used when one of several branches is chosen, based on a
decision or workflow control data [9]. Technician is also accountable for the follow-
ing steps, related to lookups made in order to obtain cards and delivery codes, and to
record sorting; this activity closes the Transformation phase. The last step of the fact
process is the Loading phase, which inserts data into the Delivery table of the recon-
ciled model. Activities inserted into this process are executed for each record of the
data source Delivery table. The proposed approach applied to the running example
allows organizing ETL process, by a conceptual separation between data quality man-
agement and facts loading procedures, and graphically representing the process,
thanks to control flow models, design patterns and BPMN notation, in a fact-centered
perspective, oriented to real-time BA.




   Fig. 4. DFM of cards delivery (on the left) and ETL process representation (on the right)




                               Fig. 5. Delivery Table Loading
5      Conclusions

The paper links conceptual ETL modeling with real-time BA and Big Data concepts.
The changes in business requirements and needs as well as the growth of data vol-
umes to be analyzed, lead to the introduction of new solutions to provide users with
analytical data that support efficiently the decision-making process. In this context,
we presented a fact centered approach to ETL design that aims at reducing ETL pro-
cess granularity according to the identification of BFs so that they can be processed
independently. Our proposal also structures ETL processes by distinguishing data
quality management activities from fact processes, and by using control flow models,
workflow patterns and the BPMN notation to support the conceptual representation of
the ETL process. The proposed approach has been applied to a SLA Management
platform managing service levels related to cards printing and delivery services.
   Next steps will include the specification of a tool for graphically design and trans-
form ETL process into ETL platform specific language. Moreover we are designing a
wizard to support the definition of BFs, according to the business artifact theory [13].


Acknowledgement

This work is partially funded by Engineering S.p.A. and Essentia s.r.l.


References
 1. El Akkaoui Z, Mazón JN, Vaisman A, Zimányi E. BPMN-Based Conceptual Modeling of
    ETL Processes. LNCS vol. 7448, 2012, pp 1-14.
 2. Mell P, Grance T. The NIST Definition of Cloud Computing. Special Publication 800-145,
    National Institute of Standards and Technology, 2011.
 3. Birst. Why Cloud BI? The 9 Substantial Benefits of SaaS BI. Birst, 2010.
 4. Wilkinson K, Simitsis A, Dayal U, Castellanos M. Leveraging Business Process Models
    for ETL Design. Conceptual Modeling – ER 2010, pp 15-30.
 5. Dayal U, Wilkinson K, Simitsis A, Castellanos M. Business Processes Meet Operational
    Business Intelligence. Bulletin of the IEEE Comp. Soc. TC on Data Engineering, 2009.
 6. Azimuddin K, Karunesh S. Business Intelligence: a new dimension to business. Institute of
    Business Management, Pakistan Business Review, 2011.
 7. Schmidt W. Business Activity Monitoring (BAM). Business Intelligence and Performance
    Management, Advanced Information and Knowledge Processing, 2013, pp 229-242.
 8. Golfarelli M, Rizzi S. Data Warehouse. McGraw-Hill, 2006.
 9. Van Der Aalst W. et al., Workflow Patterns. Distributed and Parallel Databases, 2003.
10. Luján-Mora S, Vassiliadis P, and Trujillo J. Data Mapping Diagrams for Data Warehouse
    Design with UML. In ER, pages 191–204, 2004.
11. Mazón JN, Trujillo J, Serrano MA, Piattini M. Applying MDA to the development of data
    warehouses. In DOLAP, pages 57–66, 2005.
12. Kimball R, Caserta J: The Data Warehouse ETL Toolkit: Practical Techniques for Extract-
    ing, Cleaning, Conforming, and Delivering Data. 2004.
13. Cohn, D., & Hull, R. Business Artifacts. Bulletin of the IEEE Comp. Soc. TC on Data En-
    gineering, vol. 32, pp. 1-7, 2009.