-

There is no Data Science without Data Governance: a Proposal Based on Knowledge Graphs

Besim Bilalli

Petar Jovanovic

Sergi Nadal

Anna Queralt

Oscar Romero

0 0 Universitat Politècnica de Catalunya , UPC-BarcelonaTech

Data Science and data-driven Artificial Intelligence are here to stay and they are expected to further transform the current global economy. From a technical point of view, there is an overall agreement that disciplines based on data require to combine data engineering and data analysis skills, but the fact is that data engineering is nowadays trailing and catching up with the rapid changes in the data analysis landscape. To unleash the real power of data, data-centric systems must be professionalized, i.e., operationalized and systematized, so that repetitive, time-consuming and error-prone tasks are automated. To such end, we propose our vision on next generation data governance for data-centric systems based on knowledge graphs. We claim that without the knowledge embedded in the data governance layer, Data Science will not unleash its potential.

eol>data lifecycle data management data analytics data governance data science

1. A Data-Centric System the data principles to establish the link between the data assets and the business, machine readable metadata to We are nowadays witnessing the raise of the so-called describe, not only the data assets, but also information data-driven economy where data is an organization asset about how to access and manipulate data. Metadata defrom where to extract objective evidences and gain com- scribing the complete data lifecycle within the system petitiveness. However, all the promises related to data is mandatory (i.e., datasets used in a specific analysis, and its transforming aspects, are beyond realization. transformations and data preparation performed, algo

First, collecting, organizing and managing large data rithm chosen, model training information, etc.). Finally, repositories is hard. Concepts such as data lakes, data a traversal but equally relevant aspect is data quality, fabric, data mesh or DataOps, among many others, have which includes the qualitative description of the data asarisen to help systematizing and operationalizing data sets. Importantly, as part of the metadata describing the management. Yet, current solutions require a huge man- data lifecycle, transformations conducted to guarantee ual burden and there are still no reference architectures data quality must be included. (such as Data Warehousing for Business Intelligence, In short, data governance claims for a systematic orwhich is however not suitable for the problems framed ganization and annotation of data assets. Yet, current by Data Science) [ 1 ]. Thus, organizations tend to work works either focus on how to organize data assets (i.e., with diferent data silos, which are fragmented views of data management) or to annotate it with metadata (data their own data that, in many cases, they are not able to enrichment). But there are no end-to-end data govercross. As a result, most data analysis conducted nowa- nance proposals covering the whole data lifecycle. days are based on certain available data, which are neither Figure 1 presents the ambitious architectural frameproperly contextualized nor contain all the potentially work we propose to make data governance true. relevant variables in the organization. Our vision is grounded on four main subsystems: (i)

The main reason behind all these problems is the lack the data management subsystem stores and manages the of governance of the whole data lifecycle. Data gover- data assets, (ii) the data analysis subsystem is where the nance may be defined as to what decisions must be made analytics take place, (iii) the data governance subsysto ensure efective data management and data usage and tem, where all the decisions, transformations and actions who makes the decision [ 2 ]. We identify the four main made at any step of the data lifecycle are annotated in a aspects required to govern the complete data lifecycle [ 3 ]: machine-readable format using knowledge graphs and (iv) the exploitation subsystem, where a set of modules, DOLAP’24: 26th International Workshop on Design, Optimization, which interface the data governance subsystem, embed Languages and Analytical Processing of Big Data usual actions (e.g., create artifacts in the data manage$ besim.bilalli@upc.edu (B. Bilalli); petar.jovanovic@upc.edu ment and / or analysis subsystems). As such, this archi(P. Jovanovic); sergi.nadal@upc.edu (S. Nadal); tecture mimics that of a database system and, ideally, user (aOn.nRa.oqmueerraol)t@upc.edu (A. Queralt); oscar.romero@upc.edu interactions should always be conducted via the exploita© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License tion layer to guarantee that, whatever action taken, it CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) is properly annotated in the data governance subsystem The core of this architecture is the layered knowledge(portraying the data independence principle). graph created for data governance, which will enable

Relevantly, the data management subsystem follows the development of next generation data-centric systems good practices and distributes the data assets (from raw providing several benefits, specially, in the data analysis data to other levels of transformations) into zones to sep- end, that will smooth current dificulties in data-centric arate concerns and facilitate maintenance and evolution. projects. In short, we claim that a rigorous data goverA dataset is registered into the system via the register nance: (i) facilitates systematizing and operationalsource module. Registering a dataset automatically trig- izing data-centric projects, where data-related artifacts gers several automatic tasks: (i) generate a graph-based are organized to facilitate developing, maintaining and representation of its schemata (also known as bootstrap- evolving complex operations on top of them; (ii) enables ping) and (ii) mappings (via the data discovery module) automation of complex processes. Specifically, we tarto a (iii) formatted representation of such data accord- get the full automation of repetitive, time-consuming and ing to the chosen canonical data model (e.g., key-value). error-prone tasks both for data management and analysis. The integration module consolidates a set of datasets into Governance brings many benefits in this aspect: (a) the a single integrated graph, which represents the system burden of collecting, storing and managing datasets is integrated schema. Relevantly, mappings between the mostly hidden from the end-user, and (b) data analysis integrated and local graphs allow to query the system can be automated, in simple scenarios, via analytical invia the integrated graph for exploration purposes. The tents expressed over the integrated graph. (c) Although integrated graph is the core metadata artifact through we acknowledge that some aspects of the data lifecycle which the users will interact with the system. For ex- cannot be fully automated, these can be supported (e.g., ample, data quality actions are conducted on top of the rank alternatives): data integration, interpretation of anaintegrated graph (and propagated to the sources) via the lytical results, etc. Finally, governance (iii) generates rich data curation module, whose data assets are stored in metadata that can be analyzed to conduct meta-analysis the trusted zone. The day-by-day vocabulary, linked to about how data is used at any levels: collected, stored, the integrated graph, allows the users to express their transformed, analyzed, etc. or or use that knowledge to needs in terms of their known vocabulary. Accordingly, enrich / contextualize data analysis (e.g., to avoid LLMs end-users may express an analytical intent on top of the hallucination). integrated graph via the intent-based specification module. This module leverages on the analytical dataflow generation module that first materializes an integrated Acknowledgments dataset in the exploitation zone and then, from it, generates the required data analysis workflow according to the intents expressed. Finally, all decisions made during the execution of any of the modules mentioned is properly annotated in the traceability graph.

This work is supported by the Horizon Europe Programme under GA.101135513 (CYCLOPS) and GA. 101093164 (ExtremeXP) and the Spanish Ministerio de Ciencia e Innovación under project PID2020-117191RBI00 / AEI/10.13039/501100011033 (DOGO4ML).

[1]

T. D.

Bie ,

L. D.

Raedt ,

Hernández-Orallo ,

H. H.

Hoos ,

Smyth , C. K. I. Williams , Automating data science , Commun. ACM 65 ( 2022 ) 76 - 87 .

[2]

Weill , J. Ross, IT Governance: How Top Performers Manage IT Decision Rights for Superior Results , 2004 .

[3]

Nadal ,

Jovanovic ,

Bilalli ,

Romero , Operationalizing and automating data governance , J. Big Data 9 ( 2022 ) 117 .