Value-driven Approach for Designing Extended Data Warehouses Nabila Berkani Ladjel Bellatreche Ecole nationale Supérieure d’Informatique, BP 68M, ISAE-ENSMA, France 16309, Oued-Smar, Alger, Algérie. ladjel.bellatreche@ensma.fr n-berkani@esi.dz Selma Khouri Carlos Ordonez Ecole nationale Supérieure d’Informatique, BP 68M, University of Houston, USA 16309, Oued-Smar, Alger, Algérie. carlos@central.uh.edu s-khouri@esi.dz ABSTRACT user requirements [9]. This is due to the fact that data sources In a very short time, the data warehouse (DW) technology participating in the DW construction are not rich enough in has gone through all the phases of a technological product’s terms of concepts and instances. This limitation can decrease the life: introduction on the market, growth, maturity and decline. value of the target DW and consequently its ROI, where deci- Maturity means there is a clearly identified design life cycle sions will be made perhaps without data. This scenario is deeply plus a race and competition between companies to increase their related to the trade-off between Closed World (CWA) and Open decision-making power. Decline was signaled by the appearance World (OWA) Assumptions. The CWA assumption states what of Big Data. It is therefore essential to find other challenges that is not known to be true must be false, whereas OWA is the op- will contribute to the revival of DW while taking advantage posite. Therefore, building a DW only from database sources of the V’s of Big Data. The arrival of Linked Open Data (LOD) may penalize its value. To deal with the Value risk, recent stud- era is an excellent opportunity for both the DW academia and ies propose augmenting traditional DW sources with external industry communities. LOD may bring an additional Value that Web sources such as LOD and knowledge graphs [12]. Usually, the sources feeding a DW typically do not usually succeed LOD store data with high quality, since important efforts in cu- to yield. Offering the added value of a DW is related to a high ration, cleaning, entity resolution, etc. are deployed. If a company Variety of sources. In this paper, first, we conceptualize the variety succeeds in selecting LOD, their integration into DW will not of internal and external sources and study its impact on the ETL affect its quality. This is because the traditional ETL processes phase to ease the value capturing. Secondly, three scenarios for will be augmented to deal with this new source of data. The price integrating LOD in the DW are given. Finally, experiments to pay by designers when considering LOD is managing data are conducted to show the effectiveness of our approach. Variety. They bring a new format of data usually incompatible with traditional ones. To augment a DW with LOD, important efforts in conceptualizing and managing data Variety have to be 1 INTRODUCTION made. This variety concerns these main aspects: (a) the universe In contrast to traditional database applications, the process of of discourse (UOD) of sources, (b) their conceptual formalisms, building DW is a complex, expensive, and time-consuming task. and (c) their physical implementations. Knowing this risk, companies willing to conduct a DW project In the context of LOD, their schema has the potential for be- should never start unless managers are convinced that its benefits ing the chosen model that federates external sources, as it is open, outweigh the cost, known as Return Of Investment (ROI). Survey standardized, visualizable, and associated with graph database studies conducted by analytical companies such as International tools, while maintaining interoperability with semantic databases Data Corporation and SAS clearly conclude that DW technology and ontologies. Moreover, it allows conceptualizing variety at provides a good payback, in the sense that the average ROI for two levels: vocabulary (using ontologies) and formalisms (RDF a DW is far above the industry average, confirming the added- schema). In this paper, we defend this LOD-augmented scenario value of DW technology. and we show that it significantly impacts the ETL environment With the arrival of Big Data, companies owning a DW had including its tasks, workflows and operators. With this motiva- to change their BI strategy and align it. This alignment comes tion in mind, we propose three comprehensive scenarios for a from facing the V’s brought by Big Data (Volume, Variety, Veloc- company to integrate LOD into the DW design, while meeting ity, Veracity). This situation pushes these companies to enhance the Variety and added Value requirements at the conceptual level. their DW environment with Big Data technology, including These scenarios are distinguished based on ordering events stipu- distributed programming, cloud computing, parallel processing lating the time when the company decides to build its DW and and so on. These technologies mainly focus on managing Vol- the time when it decides to connect its DW to a relevant LOD. ume and Velocity of data, leaving Variety as a second priority. More precisely, these scenarios follow two main schedules: (a) More recently, considering value-requirements fixed by a com- the DW meets LOD and (b) the DW was designed before pany such as money invested, awed customers, increased sales, LOD. In the first schedule, the DW is built from scratch by a etc. has spawned another V: Value. simultaneous integration of internal and external data sources, The DW value has to be evaluated considering risks [17]. whereas in (b), we assume that the DW was constructed well One of the most important risks is the lack of satisfaction of before the company decided to integrate LOD. Therefore, such © 2019 Copyright held by the author(s). Published in the Workshop Proceedings DW needs to continuously integrate data from local sources of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on and LOD. Thanks to conceptual modeling, variety is managed CEUR-WS.org. at ETL level. Value is handled by introducing metrics related to them. A transition determines the execution sequence of activ- requirement satisfaction. ities to generate a data flow from sources to the target DW. This paper is structured as follows: Section 2 positions LOD The ETL activities are defined at the conceptual level in order to in the DW landscape. Section 3 is related to the conceptualiza- manage UOD. ETL operators manage instances that are stored tion of the variety and its impact on ETL. Section 4 details our in the sources according to a defined format (relational, semantic, scenarios integrating LOD in the DW design to add value. An etc). The graph model of LOD is used as a pivot model, and in- experimental study is conducted in Section 5. Section 6 concludes ternal sources are mapped to this format. A mapping effort from our paper and outlines future work. internal sources within pivot model is needed. ETL operators are redefined using the graph format of LOD, that we consider 2 RELATED WORK as the elected pivot model. ETL process allow extracting the in- A couple of recent studies consider LOD in the process of DW stances, transform them and load them (ETLOperator Class). We construction, without emphasizing on value. These studies can used the ten generic operators proposed in [18] that we classify be projected on the conventional DW life cycle design that in- into three groups: Source operators, Transform operators and Store cludes: requirements definition, conceptual design, ETL, logical operators. In the next Section, we extend these operators to deal and physical design. Even though the experience has shown that with our scenarios. The set of operators of each group is defined bring requirements in forefront ensure the DW to be tightly as enumerations in the model. An example of redefinition of tailored to the users requirements [8], in most studies considering operators on LOD graph format is given: LOD in DW design, users’ requirements are either ignored or Extract(G, N j , CS): extracts, from G, the node N j satisfying con- assumed defined. Our work is motivated by the importance of straint CS; requirements on the DW system incorporating LOD in iden- Context(G, Gc , Ctx): extracts from G a sub-graph Gc that satisfy tifying missing concepts and instances required for DW value the context defined in Ctx. augmentation. LOD works have come up with new approaches for man- 3.2 Second V: Value aging variety of sources, covering only some parts of the life The construction of value-based DW is formalized as follows: cycle phases, namely : (i) Conceptual level: the unification of the given: (i) a set of internal sources SI = {Si 1, Si 2, ...Sim }, (ii) a set universe of discourse is either ignored, or handled relying on of external LOD sources : SE = {Se 1, Se 2, ...Sem }. Each internal ad-hoc structures such as correspondence tables (using similar- and external source has its own format FormatS j and its concep- ity measures) [6, 15] or a shared ontology [1]. (ii) Logical level: tual model CM j describing its universe of speech. (iii) a set of most studies highlight multidimensional models and map LOD requirements G to be satisfied. (vi) A DW (to be defined or oper- sources to this logical format. These models are either generic ational) with its conceptual model CM DW describing its UOD multidimensional models or ad-hoc models. Other studies priv- and one or more formats Format DW ={ f 1, f 2, ...fk }. (v) The value- ilege LOD format based on the graph representation, for the requirement fixed by the company. The added value of the target target DW [5–7]. (iii) ETL : variety management process was warehouse (V alue) regarding a given value-requirement can be handled using conventional ETL process that integrate and load calculated as follows: external sources to the DW [5, 6]. Other studies propose an Õ incremental fetching and storing of external sources on-demand, V alue = W eiдht(Si ) ∗ V alue(Si ) (1) i.e., as they are needed during the analysis process [2, 5]. In these S i inS I ∪S E studies, a single scenario for integrating LOD is assumed, thus variety is treated partially, in the sense that it has already been where weiдht(Si ) describes the weight of each source and it can dealt with in the initial DW. (iv) Physical level: proposed sce- be estimated for a given organizational sector. nario obliges designers to manage variety of LOD according In our work, the value-requirement concerns the DW require- the physical implementations of DW: at unification formalism ment satisfaction which is strongly related to multidimensional level [2, 6, 10] or at querying level ([11, 13, 16]. concepts and instances provided by sources. Therefore, we pro- Contrary to existing studies, our approach proposes three main pose three value metrics associated to each source Si : V alueReq, contributions: (i) a conceptualization of variety in the presence V alueMD and V alueInst that are defined as follows, note that of internal and external LOD sources, (ii) it proposes different these equations measure the percentage of value added from scenarios inspired from the organizational level of a company external sources in terms of MD concepts and requirements to that decide to incorporate LOD within its internal sources. (iii) be met (rate needs of table 1). : Our approach analyzes then the impact of these variety scenarios on the Value. NumberReponsesReq(Si ) V alueReq(Si ) = (2) NumberReponsesReq(DW ) 3 VARIETY AND VALUE MANAGEMENT where NumberReponsesReq(Si ) indicates the number of results 3.1 First V: Variety of the queries expressing the initial requirements on the source Si In this section, we provide a conceptualization of the ETL environ- and NumberReponsesReq(DW ) represents the number of results ment covering the three scenarios proposed. The ETL processes of the queries expressing the requirements on the target DW. includes: (a) operators, (b) activities and (c) work-flow. By the means of the WfMC1 , we propose a metamodel to handle the vari- Number _Concepts(Si ) ety of internal and external sources (Figure 1). An ETL workflow V alueMD(Si ) = (3) TotalNumber _Concepts(DW ) is the global collection of ETL activities and transitions between where Number _Concepts(Si ) is the number of multidimensional 1 http://www.wfmc.org/ concepts of DW schema by integrating the source i and Figure 1: ETL workflow MetaModel. TotalNumber _Concepts(DW )) describes the total number of process affected by a flow change. It requires the consolidation of multidimensional concepts of DW. two parallel ETL flows (internal and external) which is needed to keep the target warehouse up to date. We formalize the problem NumberInstancesInt(Si ) V alueInst(Si ) = (4) of consolidation, by introducing the main design operation syn- TotalIns(DW ) chronize. It corresponds to a synchronization between: (i) Current where NumberInstancesInt(Si ) and TotalIns(DW ) represent the flow: existing ETL flows satisfying the n current information number of instances of DW by integrating the source Si and requirements at t time and (ii) New flow: ETL flow satisfying the the total number of instances of the DW. upcoming requirement at (t+1) time. Synchronize corresponds These metrics could be enriched by considering both com- to a workflow pattern and corresponds to a series of operations pleteness and consistency. Several metrics to measure quality of commonly encountered in workflow management: (i) AND-Join: referential integrity exists [14]. We could revisit them to measure identify both ETL flows, apply potential deadlocks and perform Value, combined with other Vs. the join operation between concepts, (ii) OR-Join: corresponds to a merger operation of concepts and properties done using Merge 4 OUR SCENARIOS operator and (iii) Clean: performs a data cleaning, checks null Three main scenarios reflecting the policies of a company for values and deletes duplicate data before loading in the target incorporating LOD are proposed : DW. Synchronize operator is defined using the graph format as follows: 4.1 Serial Design - Synchronize(G, G i , G j , CS): Synchronize two sub-graphs G i and G j based on some criteria CS (AND-JOIN/OR-JOIN). It is feasible when a company decides to build its DW by con- sidering both internal and external sources from scratch. This scenario follows a conventional DW design. LOD is consid- 4.3 Query-driven Design ered as a new semantic source to manage in addition to internal This scenario corresponds to the on-demand ETL to feed the sources. An ETL process is defined considering all the sources. target DW. Here, data are incrementally fetched from exist- This scenario is not realistic because it requires redefinition of ing DW and LOD (in case where it is necessary), then loaded the whole ETL process each time new requirements are needed into the DW only when they are needed to answer some cubes for extracting value from LOD. queries. This scenario requires rewriting the cubes queries on LOD, extract required fragments of LOD (using Context opera- 4.2 Parallel design tor), apply the transformations required (using the Class Transform- This scenario assumes that the target DW is operational and Operator) by mean on an ETL process dedicated to LOD and keeps integrating data from internal sources and LOD. The ETL then materialize (using Store operator) the resulting graphs in process from LOD is generated and then synchronized with case they are required later. The results are first integrated in a the initial ETL process from internal sources before the loading data cube reserved for LOD data analysis, and final results of step. This proposal describes the reaction of the parts of the ETL queries are merged with the results of cubes queries executed on the internal DW in order to display the query result to the end To evaluate the second criterion, we have formulated our user user. requirements in the form of cubes queries executed once on the This requires the extension of the ETL workflow meta model target DW. The execution of the cubes queries was carried out by Query operator class (illustrated by red dots in Fig. 1). The in four time stages (t0, t1, t2 and t3) during the integration pro- process of this scenario is conceptually illustrated in ETL meta cess. The time t0 corresponds to the time of considering LOD model by linking Query operator Class to the classes: Source in addition to internal sources. Figure 3b describes the obtained Operator (for extracting LOD of queries), Store Operator (for results that demonstrate that before taking into account LOD in materializing LOD) and Transform Operator in order to han- the integration process, the user requirements that are satisfied dle the transformations required by the ETL process dedicated by internal data sources represent ∼ 65%. Once LOD integration to LOD (e.g., aggregation and join operations). We also added process has begun, we remark that this rate increases consider- the methods Rewrite_Query and MergeResult_Queries (unify the ably until reaching a maximum rate of 96%. We also noticed that results obtained) to the Query operator Class to manage the dif- the third scenario (Query driven design) gives the best result and ferent querying operations mentioned. Merge operator is defined meets the user needs faster than the other scenarios. This can be using the graph format as follows: explained by the fact that this scenario focuses on integrating - Merдe(G, G i , G j ): merges two sub-graphs G i and G j into one data that correspond to specific queries reflecting user needs. graph G. Table 1 extends the above results and demonstrates the value added by considering LOD in the design of DW. A comparison 5 EXPERIMENTAL STUDY is given between our proposal considering the three scenarios In this section, we carry out a set of experiments to show the and previous work [4], on the basis of some criteria identified effectiveness of our proposal in terms of managing the variety during the experimentation. These results clearly indicate that and augmenting the value of the final DW. the consideration of LOD data offers a value-added in terms of the final number of dimensions (Dim) and measures (Meas), the Experimental Setup. Let us consider the Film Academy Awards size of the target DW and rates of satisfied requirements than organizations from four countries considered as internal data classical approaches. sources to be integrated as follows: French Cesar awards (12, 123 004 triples), Deutscher Filmpreis (8 96 962 triples), India IIFA Scenarios Dim/Meas Rate needs Input Size Response time Awards (15, 3,9 x 105 triples) and USA Oscar (19, 2,5 x 106 record Internal Sources 6/1 60% 550 x 103 1.1 sets). Let us assume that these organizations collaborate to glob- Serial Design 10/7 80% 7.9 x 106 3.2 ally analyze the cinematography industry. We considered a set of Parallel Design 11/8 84% 3.1 x 106 2.6 (15) analytical requirements (eg. the popularity of an actor/actress Query driven design 12/8 96% 2.9 x 106 1.7 by year). The first three data sources are implemented on Oracle Table 1: Value added by considering LOD semantic DBMS using N-Quads format, while the fourth one on . a traditional (Relational) Oracle DBMS. Our external resource corresponds to a fragment of DBpedia extracted using the con- text operator applied to Movies. The obtained fragment contains The evaluation of scenario presented in the example was made around 7,9 x 106 graph Quads. Our evaluations of the ETL pro- by users who evaluated 80% of concepts and instances in total. cesses according to the different scenarios were performed on a They judged that 97% of the concepts and 91% of the instances laptop computer (HP Elite-Book 840 G3) with an Intel(R) CoreTM in the sample are correct and 96% of requirements are satisfied. i7-6500U CPU 2.59 GHZ and 8 GB of RAM and a 1 TB hard disk. Additionally, adding LOD only takes 1% of additional time (1-3 We use Windows10 64bits. secs), which we may consider as fast (cf. Table 1). Variety Evaluation. The purpose of this evaluation is to 6 CONCLUSION study the impact of the efforts in conceptualizing variety in the In this paper, we show that the DW technology is still alive and obtained warehouse. In the first experiment, we compare the the advances brought by Big Data and the explosion of LOD impact of our elected graph model against a pivot metamodel contribute to its renaissance. We consider two V’s related to value proposed in [3] (called graph property pivot model) in terms of and variety. LOD are viewed as external sources that contribute concepts, attributes, relationships and instances. To conduct this to increasing the variety of the whole sources. We formalized the experiment, we considered our three scenarios. Figures 2a and ETL environment and proposed a pivot meta model ETL work- 2b summarize the obtained results by averaging the number of flow. Based on modeling and meta-modeling efforts, the variety elements per scenario. It clearly shows that our elected graph is nicely managed. Regarding value of the target DW, we have model captures more elements than the pivot metamodel. This defined metrics associated to initial requirement satisfaction. Our is because all elements satisfying the requirements of the LOD scenarios for integrating LOD into the DW design have been fragment are 100% materialized in the warehouse. proposed. They are: (i) LOD and internal sources are physically Added Value Evaluation. The second experiment was con- materialized in the DW and (ii) both LOD and DW query ducted to measure the value captured by the obtained DW. results are merged. Our validation showed that adding LOD We use two criteria representing the rate of multidimensional increases the value of the target warehouse represented by the concepts and integrated data and the rate of satisfied require- initial user satisfaction. We also realized that the incorporation ments during the integration of LOD. Figure 3a illustrates the of the LOD is inexpensive in terms of development and fast obtained results. The consideration of LOD fragment increases thanks to our variety management. the number of multidimensional concepts for the three scenarios. We are currently working on the development of a CASE tool By comparing the three scenarios, we figure out that they are that deals with internal and external sources. Another issues con- almost equivalent. cerns the risk of integrating the LOD in a DW and detecting (a) Conceptual elements integrated using Graph LOD and Graph Property. (b) Instances integrated using Graph LOD and Graph Property. Figure 2: Comparison between LOD and Graph property pivot models (a) Number of multidimensional concepts vs. integrated L O D . (b) Rate of satisfied requirement during the integration of L O D . Figure 3: The rate of satisfied requirements during the integration of LOD and repairing inconsistency and incompleteness in external data [15] F. Ravat, J. Song, and O. Teste. Designing multidimensional cubes from sources by considering other V’s: Veracity . warehoused data and linked open data. In RCIS, pages 1–12, 2016. [16] R. Saad, O. Teste, and C. Trojahn. Olap manipulations on rdf data following a constellation model. In Workshop on Semantic Statistics, 2013. [17] T. P. Sales and et al. The common ontology of value and risk. In ER, pages REFERENCES 121–135, 2018. [1] A. Abelló Gamazo, E. Gallinucci, M. Golfarelli, S. Rizzi Bach, and [18] D. Skoutas and A. Simitsis. Ontology-based conceptual design of ETL processes O. Romero Moral. Towards exploratory olap on linked data. In SEBD, pages for both structured and semi-structured data. Semantic Web, 3(4):1–24, 2007. 86–93, 2016. [2] L. Baldacci, M. Golfarelli, S. Graziani, and S. Rizzi. Qetl: An approach to on-demand etl from non-owned data sources. DKE, 112:17–37, Nov 2017. [3] N. Berkani and L. Bellatreche. A variety-sensitive ETL processes. In DEXA (2), pages 201–216, 2017. [4] N. Berkani, L. Bellatreche, and S. Khouri. Towards a conceptualization of ETL and physical storage of semantic data warehouses as a service. Cluster Computing, 16(4):915–931, 2013. [5] A. Berro, I. Megdiche, and O. Teste. Graph-based ETL processes for warehous- ing statistical open data. In ICEIS 2015, pages 271–278, 2015. [6] R. P. Deb Nath, K. Hose, and T. B. Pedersen. Towards a programmable semantic extract-transform-load framework for semantic data warehouses. In DOLAP, pages 15–24, 2015. [7] L. Etcheverry, A. Vaisman, and E. Zimányi. Modeling and querying data warehouses on the semantic web using qb4olap. In DaWAK, pages 45–56, 2014. [8] P. Giorgini, S. Rizzi, and M. Garzetti. Goal-oriented requirement analysis for data warehouse design. In ACM DOLAP, pages 47–56, 2005. [9] M. Golfarelli and S. Rizzi. A survey on temporal data warehousing. IJDWM, 5(1):1–17, 2009. [10] B. Kämpgen and A. Harth. Transforming statistical linked data for use in OLAP systems. In I-SEMANTICS, pages 33–40, 2011. [11] B. Kämpgen, S. O’Riain, and A. Harth. Interacting with statistical linked data via OLAP operations. In ESWC, pages 87–101, 2012. [12] N. Konstantinou and et al. The VADA architecture for cost-effective data wrangling. In SIGMOD, pages 1599–1602, 2017. [13] A. Matei, K. Chao, and N. Godwin. OLAP for multidimensional semantic web databases. In BIRTE, pages 81–96, 2014. [14] C. Ordonez and J. García-García. Referential integrity quality metrics. Decision Support Systems, 44(2):495–508, 2008.