<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>There is no Data Science without Data Governance: a Proposal Based on Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Besim Bilalli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petar Jovanovic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergi Nadal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Queralt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Romero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>UPC-BarcelonaTech</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data Science and data-driven Artificial Intelligence are here to stay and they are expected to further transform the current global economy. From a technical point of view, there is an overall agreement that disciplines based on data require to combine data engineering and data analysis skills, but the fact is that data engineering is nowadays trailing and catching up with the rapid changes in the data analysis landscape. To unleash the real power of data, data-centric systems must be professionalized, i.e., operationalized and systematized, so that repetitive, time-consuming and error-prone tasks are automated. To such end, we propose our vision on next generation data governance for data-centric systems based on knowledge graphs. We claim that without the knowledge embedded in the data governance layer, Data Science will not unleash its potential.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data lifecycle</kwd>
        <kwd>data management</kwd>
        <kwd>data analytics</kwd>
        <kwd>data governance</kwd>
        <kwd>data science</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. A Data-Centric System
the data principles to establish the link between the data
assets and the business, machine readable metadata to
We are nowadays witnessing the raise of the so-called describe, not only the data assets, but also information
data-driven economy where data is an organization asset about how to access and manipulate data. Metadata
defrom where to extract objective evidences and gain com- scribing the complete data lifecycle within the system
petitiveness. However, all the promises related to data is mandatory (i.e., datasets used in a specific analysis,
and its transforming aspects, are beyond realization. transformations and data preparation performed,
algo</p>
      <p>
        First, collecting, organizing and managing large data rithm chosen, model training information, etc.). Finally,
repositories is hard. Concepts such as data lakes, data a traversal but equally relevant aspect is data quality,
fabric, data mesh or DataOps, among many others, have which includes the qualitative description of the data
asarisen to help systematizing and operationalizing data sets. Importantly, as part of the metadata describing the
management. Yet, current solutions require a huge man- data lifecycle, transformations conducted to guarantee
ual burden and there are still no reference architectures data quality must be included.
(such as Data Warehousing for Business Intelligence, In short, data governance claims for a systematic
orwhich is however not suitable for the problems framed ganization and annotation of data assets. Yet, current
by Data Science) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Thus, organizations tend to work works either focus on how to organize data assets (i.e.,
with diferent data silos, which are fragmented views of data management) or to annotate it with metadata (data
their own data that, in many cases, they are not able to enrichment). But there are no end-to-end data
govercross. As a result, most data analysis conducted nowa- nance proposals covering the whole data lifecycle.
days are based on certain available data, which are neither Figure 1 presents the ambitious architectural
frameproperly contextualized nor contain all the potentially work we propose to make data governance true.
relevant variables in the organization. Our vision is grounded on four main subsystems: (i)
      </p>
      <p>
        The main reason behind all these problems is the lack the data management subsystem stores and manages the
of governance of the whole data lifecycle. Data gover- data assets, (ii) the data analysis subsystem is where the
nance may be defined as to what decisions must be made analytics take place, (iii) the data governance
subsysto ensure efective data management and data usage and tem, where all the decisions, transformations and actions
who makes the decision [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We identify the four main made at any step of the data lifecycle are annotated in a
aspects required to govern the complete data lifecycle [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: machine-readable format using knowledge graphs and
(iv) the exploitation subsystem, where a set of modules,
DOLAP’24: 26th International Workshop on Design, Optimization, which interface the data governance subsystem, embed
Languages and Analytical Processing of Big Data usual actions (e.g., create artifacts in the data
manage$ besim.bilalli@upc.edu (B. Bilalli); petar.jovanovic@upc.edu ment and / or analysis subsystems). As such, this
archi(P. Jovanovic); sergi.nadal@upc.edu (S. Nadal); tecture mimics that of a database system and, ideally, user
(aOn.nRa.oqmueerraol)t@upc.edu (A. Queralt); oscar.romero@upc.edu interactions should always be conducted via the
exploita© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License tion layer to guarantee that, whatever action taken, it
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
is properly annotated in the data governance subsystem The core of this architecture is the layered
knowledge(portraying the data independence principle). graph created for data governance, which will enable
      </p>
      <p>Relevantly, the data management subsystem follows the development of next generation data-centric systems
good practices and distributes the data assets (from raw providing several benefits, specially, in the data analysis
data to other levels of transformations) into zones to sep- end, that will smooth current dificulties in data-centric
arate concerns and facilitate maintenance and evolution. projects. In short, we claim that a rigorous data
goverA dataset is registered into the system via the register nance: (i) facilitates systematizing and
operationalsource module. Registering a dataset automatically trig- izing data-centric projects, where data-related artifacts
gers several automatic tasks: (i) generate a graph-based are organized to facilitate developing, maintaining and
representation of its schemata (also known as bootstrap- evolving complex operations on top of them; (ii) enables
ping) and (ii) mappings (via the data discovery module) automation of complex processes. Specifically, we
tarto a (iii) formatted representation of such data accord- get the full automation of repetitive, time-consuming and
ing to the chosen canonical data model (e.g., key-value). error-prone tasks both for data management and analysis.
The integration module consolidates a set of datasets into Governance brings many benefits in this aspect: (a) the
a single integrated graph, which represents the system burden of collecting, storing and managing datasets is
integrated schema. Relevantly, mappings between the mostly hidden from the end-user, and (b) data analysis
integrated and local graphs allow to query the system can be automated, in simple scenarios, via analytical
invia the integrated graph for exploration purposes. The tents expressed over the integrated graph. (c) Although
integrated graph is the core metadata artifact through we acknowledge that some aspects of the data lifecycle
which the users will interact with the system. For ex- cannot be fully automated, these can be supported (e.g.,
ample, data quality actions are conducted on top of the rank alternatives): data integration, interpretation of
anaintegrated graph (and propagated to the sources) via the lytical results, etc. Finally, governance (iii) generates rich
data curation module, whose data assets are stored in metadata that can be analyzed to conduct meta-analysis
the trusted zone. The day-by-day vocabulary, linked to about how data is used at any levels: collected, stored,
the integrated graph, allows the users to express their transformed, analyzed, etc. or or use that knowledge to
needs in terms of their known vocabulary. Accordingly, enrich / contextualize data analysis (e.g., to avoid LLMs
end-users may express an analytical intent on top of the hallucination).
integrated graph via the intent-based specification
module. This module leverages on the analytical dataflow
generation module that first materializes an integrated Acknowledgments
dataset in the exploitation zone and then, from it,
generates the required data analysis workflow according to the
intents expressed. Finally, all decisions made during the
execution of any of the modules mentioned is properly
annotated in the traceability graph.</p>
      <p>This work is supported by the Horizon Europe
Programme under GA.101135513 (CYCLOPS) and GA.
101093164 (ExtremeXP) and the Spanish Ministerio de
Ciencia e Innovación under project
PID2020-117191RBI00 / AEI/10.13039/501100011033 (DOGO4ML).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Bie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Raedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernández-Orallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. K. I. Williams</surname>
          </string-name>
          ,
          <article-title>Automating data science</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>65</volume>
          (
          <year>2022</year>
          )
          <fpage>76</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Weill</surname>
          </string-name>
          , J. Ross, IT Governance:
          <article-title>How Top Performers Manage IT Decision Rights for Superior Results</article-title>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nadal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jovanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bilalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <article-title>Operationalizing and automating data governance</article-title>
          ,
          <source>J. Big Data</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>117</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>