=Paper= {{Paper |id=Vol-2840/paper5 |storemode=property |title=Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling |pdfUrl=https://ceur-ws.org/Vol-2840/paper5.pdf |volume=Vol-2840 |authors=Etienne Scholly,Pegdwende N. Sawadogo,Pengfei Liu,Javier A. Espinosa-Oviedo,Cecile Favre,Sabine Loudcher,Jerome Darmont,Camille Nous |dblpUrl=https://dblp.org/rec/conf/dolap/SchollySLEFLDN21 }} ==Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling== https://ceur-ws.org/Vol-2840/paper5.pdf
                       Coining goldMEDAL: A New Contribution to
                         Data Lake Generic Metadata Modeling
                Étienne Scholly                               Pegdwendé N. Sawadogo                                  Pengfei Liu
          Université de Lyon, Lyon 2,                         Université de Lyon, Lyon 2,                   Université de Lyon, Lyon 2,
              UR ERIC & BIAL-X                                         UR ERIC                                        UR ERIC
                 Lyon, France                                       Lyon, France                                    Lyon, France
          etienne.scholly@bial-x.com                       pegdwende.sawadogo@univ-lyon2.                  pengfei.liu@eric.univ-lyon2.fr
                                                                           fr

       Javier A. Espinosa-Oviedo                                         Cécile Favre                            Sabine Loudcher
          Université de Lyon, Lyon 2,                           Université de Lyon, Lyon 2,                  Université de Lyon, Lyon 2,
            UR ERIC-LAFMIA lab                                            UR ERIC                                     UR ERIC
                 Lyon, France                                           Lyon, France                                Lyon, France
           javier.espinosa@imag.fr                              cecile.favre@univ-lyon2.fr                 sabine.loudcher@univ-lyon2.fr

                                          Jérôme Darmont                                    Camille Noûs
                                    Université de Lyon, Lyon 2,                        Université de Lyon, Lyon 2,
                                             UR ERIC                                     Laboratoire Cogitamus
                                          Lyon, France                                        Lyon, France
                                  jerome.darmont@univ-lyon2.fr                         camille.nous@cogitamus.fr

ABSTRACT                                                                           hardly reusable. Thus, other researchers proposed more theoreti-
The rise of big data has revolutionized data exploitation prac-                    cal approaches named metadata models. Such approaches aim
tices and led to the emergence of new concepts. Among them,                        to provide detailed guidelines to metadata system design, while
data lakes have emerged as large heterogeneous data reposi-                        being generic, i.e., flexible and adaptable to many use cases. Yet,
tories that can be analyzed by various methods. An efficient                       data lake generic metadata modeling is still an open research
data lake requires a metadata system that addresses the many                       issue. A feature-based assessment indeed shows that none of the
problems arising when dealing with big data. In consequence,                       existing metadata models is generic enough, including our own
the study of data lake metadata models is currently an active                      MEtadata model for DAta Lakes (MEDAL) [20].
research topic and many proposals have been made in this re-                          To address this genericity issue, we introduce goldMEDAL, a
gard. However, existing metadata models are either tailored for                    revision of our MEDAL model. We define goldMEDAL through
a specific use case or insufficiently generic to manage different                  a classical three-level modeling process (i.e., conceptual, logical
types of data lakes, including our previous model MEDAL. In                        and physical). We choose a formal representation to avoid ambi-
this paper, we generalize MEDAL’s concepts in a new metadata                       guity but also provide a UML representation for readability. The
model called goldMEDAL. Moreover, we compare goldMEDAL                             logical level is a translation of the concepts using graph theory.
with the most recent state-of-the-art metadata models aiming                       Eventually, we describe three different physical models as proofs
at genericity and show that we can reproduce these metadata                        of concept. Furthermore, to highlight goldMEDAL’s genericity,
models with goldMEDAL’s concepts. As a proof of concept, we                        we show that the concepts of our metadata model help model
also illustrate that goldMEDAL allows the design of various data                   state-of-the-art metadata models from the literature.
lakes by presenting three different use cases.                                        The remainder of this paper is organised as follows. Section 2
                                                                                   reviews and discusses existing data lake metadata models. Sec-
                                                                                   tion 3 presents goldMEDAL’s conceptual and logical models.
1    INTRODUCTION                                                                  Section 4 illustrates how goldMEDAL generalises other data lake
                                                                                   metadata models and how it can be used to implement different
While the big data revolution has shaken up the entire field of data               data lakes. Finally, Section 5 concludes this paper and hints at
management and analytics, new concepts have emerged to meet                        future research.
these new challenges. Data lakes belong to such new concepts.
First introduced by James Dixon, a data lake is a vast repository of
raw and heterogeneous data from which various analyses can be
performed [4]. Data lakes quickly gained popularity and several                    2    RELATED WORKS
teams started to address research issues [13, 15]. A key one is                    Metadata management plays a vital role in data lakes. Indeed,
efficient metadata management for avoiding data lakes to turn                      in the absence of a fixed schema, data querying and analyses
into unexploitable data swamps [10, 11, 16, 19, 22].                               depend on an efficient metadata system. Several approaches help
   However, most metadata management proposals in the liter-                       manage metadata in data lakes. However, only a few of them
ature [1, 8, 14], and their associated implementations, give few                   provide enough detail to ensure reusability. We refer to them
details on the way data are conceptually organized and are thence                  as metadata models. In this section, we review state-of-the-art
                                                                                   metadata models (Section 2.1) and compare them with respect to
Copyright ©2021 for this paper by its authors. Use permitted under Creative Com-   genericity (Section 2.2).
mons License Attribution 4.0 International (CC BY 4.0).
2.1    Metadata Models for Data Lakes                                     To the best of our knowledge, there exist two feature-based
GEMMS (Generic and Extensible Metadata Management System)              comparisons of data lake metadata models in the literature. We
is a pioneer generic metadata model for data lakes [17]. GEMMS         introduced six relevant features: semantic enrichment, data in-
features two abstract entities: data file and data unit. A data        dexing, data polymorphism, data versioning, link generation and
file represents a generic data source. A data unit represents an       usage tracking [20]; while Eichler et al. identified three other
identifiable data element inside a data source. Each data file is      features: metadata properties, zone metadata and the support of
composed of a set of data units (e.g., a spreadsheet file is com-      multiple granularity levels [5].
posed of a set of sheets). Data files and data units can be enriched      Considering that both the above sets of features are relevant,
with atomic or complex metadata values. However, GEMMS re-             we propose to combine them for comparing the genericity of
quires information on data structure to operate. Thus, making it       metadata models. Beyond simply unioning features, we merge
unsuitable for working with unstructured data.                         data polymorphism with zone metadata, as these features both
    Ground is another generic metadata model [9] that can be used      refer to the same concept. We also split link generation in two
for modeling metadata in data lakes (although not specifically         new features, namely similarity links and categorization, because
designed for that). Ground tracks data context (metadata) at three     some metadata models support only one of them. Eventually, we
levels: 1) metadata properties, 2) data usage history and 3) data      omit data indexing in this comparison, considering that indexing
versioning. Although more extensive than GEMMS, Ground (as             does not actually induce metadata modeling issues. Although
well as GEMMS) does not take in charge data linkage even though        indexing is definitely relevant to assess metadata systems [20],
this type of metadata has been identified as relevant in data          this feature seems less suited to metadata models.
lakes [6, 20].                                                            All in all, we obtain a list of eight features that can serve to
    Based on GEMMS’ data file and data units concepts, The model       compare data lake metadata models and evaluate their genericity.
of Diamantini et al. adds similarity links between data units to in-       (1) Semantic enrichment
directly link data files [3]. However, their model does not include        (2) Data polymorphism/multiple zones
important metadata such as data versioning and usage tracking              (3) Data versioning
as compared to Ground.                                                     (4) Usage tracking
    Similar to Diamantini et al., Ravat and Zhao propose a model           (5) Categorization
where each data file can be associated with atomic and complex             (6) Similarity links
metadata [18], including metadata properties, data history and             (7) Metadata properties
links with other data files. The main contribution of this model           (8) Multiple granularity levels
is the notion of zone metadata. Many data lake architectures              Table 1 highlights the features supported by all the models
consider the existence of zones (e.g., raw data zone, processed        reviewed in Section 2.1. It shows that none of them support all
data zone) [7, 18]. Zone metadata specifies the zones where data is    the features we identify.
located. However, Ravat and Zhao’s model cannot simultaneously
represent different data granularity levels as previous models         3     GOLDMEDAL METADATA MODEL
do [3, 17].
                                                                       Section 2.1 establishes that, of the eight criteria used to compare
    MEDAL represents data through three main concepts: data
                                                                       data lake metadata models, none ticked all the boxes. In this sec-
objects, representations and versions [20]. Data objects correspond
                                                                       tion, we thoroughly describe goldMEDAL, a substantial evolution
to GEMMS’ data files. Representations correspond to the result
                                                                       of MEDAL that generalizes its concepts while addressing all the
of transformed objects. Versions represent objects updates. Both,
                                                                       features identified in Section 2.2.
representations and versions, are materialized in the data lake.
                                                                          A metadata model can be expressed “in the form of an explicit
Thus, MEDAL gives alternative ways to track data linkage and
                                                                       schema, a formal definition, or a textual description” [5]. In this
zone metadata through the concepts of versions and represen-
                                                                       paper, we choose a formal approach for the sake of precision.
tations, respectively. MEDAL also supports linkage metadata
                                                                       Yet, for the sake of readability and communication with possibly
through categorizations and similarity links. However, MEDAL
                                                                       non-computer scientists, we also provide a semi-formal UML
does not support multiple data granularity levels either.
                                                                       model. Moreover, we use a conventional data modeling approach
    Finally, HANDLE (Handling metAdata maNagement in Data
                                                                       that leverages a conceptual, a logical and a physical model, to
LakEs), uses the generic concept of data entity to represent both,
                                                                       demonstrate the actual implementation process of our metadata
data files and parts of data files, which helps HANDLE support
                                                                       model.
any granularity level [5]. In HANDLE, each data entity is as-
                                                                          Section 3.1 presents goldMEDAL’s formal and semi-formal con-
sociated with tags that represent zones, granularity levels or
                                                                       ceptual models. Section 3.2 details the translation of goldMEDAL’s
categorizations. HANDLE can also connect data entities together
                                                                       concepts into a logical, graph-based model. For the sake of clarity,
through containment links (e.g., between a table and a tuple).
                                                                       the examples we use are the same examples in both sections, i.e.,
HANDLE provides concepts that subsume most of the concepts
                                                                       examples at the conceptual level are translated at the logical level.
of the previous metadata models.
                                                                       Eventually, example physical models, i.e., metadata models actu-
                                                                       ally implemented in data lakes with goldMEDAL, are presented
                                                                       in Section 4.2.
2.2    Genericity of Metadata Models
A generic metadata model should adapt to any data lake use case.       3.1    Conceptual Model
As each use case requires specific metadata management features,
                                                                       In MEDAL, data items were considered either as raw data, or as
we consider that the most abundant features a metadata model
                                                                       versions or representations derived from raw data. The concepts
supports, the most generic it is. Therefore, features are a suitable
                                                                       of version and representation were used to express updated and
way to compare metadata models.
                                                                       transformed data, respectively. While modeling metadata for
                                       Table 1: Features supported by data lake metadata models

          Features ↓ \ Models →         GEMMS        Ground     Diamantini et al.        Ravat & Zhao       MEDAL         HANDLE        goldMEDAL
              Semantic enrichment           ✓           ✓                 ✓                     ✓               ✓            ✓               ✓
    Polymorphism/multiple zones                                           ✓                     ✓               ✓            ✓               ✓
                   Data versioning                      ✓                                       ✓               ✓                            ✓
                     Usage tracking                     ✓                                       ✓               ✓            ✓               ✓
                     Categorization         ✓           ✓                                       ✓               ✓            ✓               ✓
                    Similarity links                                      ✓                     ✓               ✓            ✓               ✓
               Metadata properties          ✓           ✓                                       ✓               ✓            ✓               ✓
        Multiple granularity levels         ✓                             ✓                                                  ✓               ✓
                               Total       4/8         5/8                4/8                  7/8             7/8          7/8             8/8


various data lakes, we found that more data items were possible,                similarity links between data entities or hierarchies between
e.g., temporal representations. Thus, we decided to generalize                  groups. For example, a temporal hierarchy month → quarter
any such concepts into a global concept named data entity in                    would have the months of January, February and March linked
goldMEDAL.                                                                      to the first quarter of a given year.
   Accordingly, we also generalized in goldMEDAL:                                  Definition 3.4. The set of links is denoted L = {𝑙𝑚 }𝑚 ∈N∗ , with
      • update and transformation operations that served to track               either:
        the lineage of representations and versions, respectively,                  • 𝑙𝑚 : E → E,
        as well as parenthood relationships that express fusion                     • 𝑙𝑚 : 𝐺 𝑗 → 𝐺 𝑗 ′ and 𝑗 ≠ 𝑗 ′ .
        operations, into the concept of process;
      • similarity links into the global concept of link.                           Example 3.5. Let us elaborate the sample hierarchy month
                                                                                → quarter. Let 𝐺 3 = {𝐽𝑎𝑛, 𝐹𝑒𝑏, ..., 𝐷𝑒𝑐} a grouping of data en-
   Eventually, we retained in goldMEDAL the MEDAL concept of
                                                                                tities per month and 𝐺 4 = {𝑄1, 𝑄2, 𝑄3, 𝑄4} be a grouping of
grouping, which notably allows multiple data granularity levels.
                                                                                quarters in a year. Now, let us make explicit some data enti-
   All the main goldMEDAL concepts (data entity, grouping, link
                                                                                ties and their groups: 𝐽𝑎𝑛 = {𝑒 1, 𝑒 2 }, 𝐹𝑒𝑣 = {𝑒 3 }, 𝑀𝑎𝑟 = {𝑒 4 };
and process) are characterized by attributes or properties that
                                                                                𝑄1 = {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4 }. Link 𝑙 1 materializes the hierarchical link be-
constitute their internal metadata.
                                                                                tween groups 𝐺 3 and 𝐺 4 : 𝐽𝑎𝑛 −→ 𝑄1, 𝐹𝑒𝑏 −→ 𝑄1, 𝑀𝑎𝑟 −→ 𝑄1.
                                                                                                                     𝑙1            𝑙1             𝑙1
   3.1.1 Data Entity. Data entities are the basic units of our
                                                                                Inversely, 𝑄1 −−→ {𝐽𝑎𝑛, 𝐹𝑒𝑏, 𝑀𝑎𝑟 }.
metadata model. They are flexible in terms of data granularity. For                             𝑙 1−1
example, a data entity can represent a spreadsheet file, a textual                A functional notation may also be used: 𝑙 1 (𝐽𝑎𝑛) = 𝑄1, 𝑙 1 (𝐹𝑒𝑏) =
or semi-structured document, an image, a database table, a tuple                𝑄1, 𝑙 1 (𝑀𝑎𝑟 ) = 𝑄1, 𝑙 1−1 (𝑄1) = {𝐽𝑎𝑛, 𝐹𝑒𝑏, 𝑀𝑎𝑟 }. Also note that
or an entire database. The introduction of any new element in                   𝑄1 = 𝐽𝑎𝑛 ∪ 𝐹𝑒𝑏 ∪ 𝑀𝑎𝑟 .
the data lake leads to the creation of a new data entity.                          3.1.4 Process. A process refers to any transformation applied
   Definition 3.1. The set of data entities is denoted E = {𝑒𝑖 }𝑖 ∈N∗ .         to a set of data entities that produces a new set of data entities.
   3.1.2 Grouping. A grouping is a set of groups; a group brings                   Definition 3.6. The set of processes is denoted P = {𝑃𝑛 }𝑛 ∈N∗ ,
together data entities based on common properties. For example,                 with 𝑃𝑛 = {𝐼𝑛 , 𝑂𝑛 }, 𝐼𝑛 ⊆ E the set of input data entities of 𝑃𝑛
the raw and preprocessed data zones common in data lake archi-                  and 𝑂𝑛 the set of output data entities that is integrated into E
tectures are the groups of a zone grouping. Another example is                  (E ← E ∪ 𝑂𝑛 ).
a grouping of textual documents according to the language of                       Example 3.7. Process 𝑃1 splits a set of textual documents 𝐷 ⊆
writing.                                                                        E into a set of text fragments 𝐹 ⊆ E. Here, 𝐼 1 = 𝐷 and 𝑂 1 = 𝐹 .
  Definition 3.2. The set of groupings is denoted G = {𝐺 𝑗 } 𝑗 ∈N∗ ,                3.1.5 UML model. Figure 1 features goldMEDAL’s conceptual
with 𝐺 𝑗 = {Γ𝑗𝑘 }𝑘 ∈N∗ and Γ𝑗𝑘 ⊆ E is a group.                                  model as a UML class diagram. All the concepts of goldMEDAL,
                                                                                including group, are modeled as classes (data entity, grouping,
   Example 3.3. To get back to our previous examples, G =                       group and process) or association classes (entity link and group
{𝐺 1, 𝐺 2 }. 𝐺 1 = {Γ11, Γ12 } is the zone grouping, with Γ11 and               link, which are labeled E-Link and G-Link in Figure 1, respec-
Γ12 being the raw data and processed data zones, respectively.                  tively).
𝐺 2 = {Γ21, Γ22 } is the language grouping, with Γ21 and Γ22 the                    Eventually, although they are not depicted in Figure 1, all
groups corresponding to French and English languages, respec-                   classes and association classes bear attributes that model meta-
tively. Note that the groupings 𝐺 𝑗 are deliberately not partitions             data properties. These attributes may be of any type, including
of E. Thus, a bilingual French-English document can belong to                   lists, and of course vary with respect to use cases.
both groups Γ21 and Γ22 .
  3.1.3 Link. Links are used to associate either data entities                  3.2     Logical Model
with each other or groups of data entities with each other. They                As MEDAL and HANDLE did, though at the physical level, we
can be oriented or not. They allow the expression of, e.g., simple              choose to design goldMEDAL’s logical model as a graph, which is
                                                                              hyperedges representing a grouping of data entities per month.
                                                                              𝐻 4 = {𝜃𝑄1, 𝜃𝑄2, 𝜃𝑄3, 𝜃𝑄4 } is a set of hyperedges representing
                                                                              the grouping of quarters in a year. Let us make this explicit
                                                                              with instances. 𝜃 𝐽 𝑎𝑛 = {𝑛 1, 𝑛 2 }, 𝜃 𝐹𝑒𝑣 = {𝑛 3 }, 𝜃 𝑀𝑎𝑟 = {𝑛 4 }; 𝜃𝑇 1 =
                                                                              {𝑛 1, 𝑛 2, 𝑛 3, 𝑛 4 }. Edge 𝑎 1 materializes the hierarchical link between
                                                                              𝐻 3 and 𝐻 4 : 𝜃 𝐽 𝑎𝑛 −−→ 𝜃𝑄1, 𝜃 𝐹𝑒𝑏 −−→ 𝜃𝑄1, 𝜃 𝑀𝑎𝑟 −−→ 𝜃𝑄1 . Inversely,
                                                                                                   𝑎1               𝑎1              𝑎1
                                                                              𝜃𝑄1 −−−→ {𝜃 𝐽 𝑎𝑛 , 𝜃 𝐹𝑒𝑏 , 𝜃 𝑀𝑎𝑟 }.
                                                                                    𝑎 −1
                                                                                      1

                                                                                3.2.4 Translation of Process. A process is modeled by an ori-
                                                                              ented hyperedge.
                                                                                 Definition 3.15. The set of oriented hyperedges modeling pro-
                                                                              cesses is denoted Q = {Π𝑛 }𝑛 ∈N∗ , with Π𝑛 = {Υ𝑛 , Ω𝑛 }, Υ𝑛 ⊆ N
                                                                              being the set of input nodes of Π𝑛 and Ω𝑛 the a set of output
        Figure 1: UML class diagram of goldMEDAL                              nodes integrated to N (N ← N ∪ Ω𝑛 ). Any Π𝑛 carries attributes.
                                                                                 Example 3.16. Π 1 = {Υ1, Ω1 } is an oriented hyperedge rep-
particularly well-suited to depict relationships between different            resenting the process of splitting a set of textual documents
concepts.                                                                     (Example 3.7) represented by the set of nodes 𝑁𝐷 ⊆ N , into a
  Thus, in this section, we translate the concepts defined in                 set of text fragments represented by the set of nodes 𝑁 𝐹 ⊆ N .
Section 3.1 into graph nodes, edges and hyperedges, using the                 Then, Υ1 = 𝑁𝐷 and Ω1 = 𝑁 𝐹 .
same indices, e.g., 𝑖, 𝑗, 𝑘... Moreover, we illustrate the translation
                                                                                   3.2.5 Sample Graph Representation. Figure 2 provides a sche-
with the examples used at the conceptual level. Finally, we also
                                                                              matic representation of the examples above. Let us introduce
propose a graphic illustration of goldMEDAL’s logical model.
                                                                              eight data entity nodes {𝑛𝑖 }𝑖 ∈ [1,8] colored in orange.
  3.2.1 Translation of Data Entity. Data entities are modeled by                   Example 3.12 is depicted on the left-hand side of Figure 2.
nodes that carry attributes.                                                  Groups of 𝐻 1 are colored in purple, while 𝐻 2 ’s are blue. We can
  Definition 3.8. The set of nodes is denoted N = {𝑛𝑖 }𝑖 ∈N∗ . Each           see that 𝑛 1 and 𝑛 3 belong to the raw data group 𝜃 11 , while 𝑛 2 and
node 𝑛𝑖 ∈ N carries attributes.                                               𝑛 4 are in the processed data group 𝜃 12 . Moreover, 𝑛 1 , 𝑛 2 and 𝑛 3
                                                                              are in the French language group 𝜃 21 , and 𝑛 4 is in the English
   Example 3.9. A PDF file stored in the data lake can be repre-              language group 𝜃 22 .
sented by a node 𝑛 1 .                                                             Example 3.14 is represented at the center of Figure 2. Groups
  3.2.2 Translation of Grouping. A group is represented by a                  of 𝐻 3 , namely 𝜃 𝐽 𝑎𝑛 , ..., 𝜃 𝐷𝑒𝑐 are colored in green and groups of 𝐻 4
non-oriented hyperedge, i.e., an edge that can link more than                 (𝜃𝑄1 , ..., 𝜃𝑄4 ) are colored in grey. Hyperedge 𝑎 1 connects groups
two nodes. A grouping is modeled by a set of hyperedges.                      of 𝐻 3 to 𝐻 4 ’s.
                                                                                   Finally, Example 3.16 is depicted on the right-hand side of
  Definition 3.10. A hyperedge (a group) is denoted 𝜃 𝑗𝑘 ⊆ N ,                Figure 2. 𝑛 5 is a textual document split in fragments 𝑛 6 , 𝑛 7 and
with 𝑗, 𝑘 ∈ N∗ . Any 𝜃 𝑗𝑘 carries attributes.                                 𝑛 8 . Π 1 ’s input and output Υ1 and Ω1 , respectively, are colored in
   Definition 3.11. The set of hyperedges of grouping 𝑗 is denoted            yellow.
𝐻 𝑗 = {𝜃 𝑗𝑘 } and carries attributes. The set of hyperedge sets (set
of groupings) is denoted H .                                                  4     GOLDMEDAL ASSESSMENT
                                                                              In this section, we discuss goldMEDAL’s genericity. To this end,
   Example 3.12. Let us translate Example 3.3. H = {𝐻 1, 𝐻 2 }.
                                                                              we show in Section 4.1 that all three most complete metadata
𝐻 1 = {𝜃 11, 𝜃 12 } is the set of hyperedges representing the zone
                                                                              models can be modeled with goldMEDAL. In Section 4.2, we
grouping, with 𝜃 11 and 𝜃 12 the hyperedges representing the raw
                                                                              present our ongoing implementation work of goldMEDAL on
data and processed data zones, respectively. 𝐻 2 = {𝜃 21, 𝜃 22 } is the
                                                                              distinct use cases.
set of hyperedges representing the language grouping, with 𝜃 21
and 𝜃 22 the hyperedges representing the groups corresponding
to French and English languages, respectively.                                4.1     Comparison of State-of-the-Art Metadata
                                                                                      Models with goldMEDAL
   3.2.3 Translation of Link. Links may model relationships be-
tween either data entities (nodes) or groups (hyperedges). They               To evaluate goldMEDAL’s genericity, we compare it with the
are modeled by edges.                                                         three metadata models that are both the most recent and the
                                                                              most complete among metadata models, i.e., MEDAL, Ravat and
   Definition 3.13. The set of edges is denoted A = {𝑎𝑚 }𝑚 ∈N∗ ,              Zhao’s and HANDLE (Section 2.2).
with any 𝑎𝑚 being either:                                                        For each comparison, we use a two-column table. The first
    • an edge, oriented or not, connecting two nodes. Then,                   column lists goldMEDAL’s concepts, and the second column
       𝑎𝑚 = (𝑛𝑖 , 𝑛𝑖 ′ ) ∈ N 2 ;                                              the corresponding concepts of the compared model. When any
    • an oriented edge connecting two hyperedges. Then, 𝑎𝑚 =                  concept does not have an equivalent, it is marked with “—”.
       (𝜃 𝑗𝑘 , 𝜃 𝑗 ′𝑘 ′ ) ∈ 𝐻 𝑗 × 𝐻 𝑗 ′ .
                                                                                4.1.1 MEDAL vs. goldMEDAL. goldMEDAL’s four main con-
In both cases, the edge carries attributes.                                   cepts help generalize all of MEDAL’s concepts (Table 2). Data
  Example 3.14. To get back to the sample hierarchy month →                   entity generalizes the concepts of version and representation.
quarters from Example 3.5, 𝐻 3 = {𝜃 𝐽 𝑎𝑛 , 𝜃 𝐹𝑒𝑏 , ..., 𝜃 𝐷𝑒𝑐 } is a set of   Grouping generalizes the concepts of object and grouping (in
                                          Figure 2: Sample goldMEDAL graph logical model


the sense of MEDAL). Link generalizes the concepts of similar-      though they could be classified as global metadata. Users and
ity link. Finally, process generalizes transformation, update and   accesses can indeed be modeled as data entities and processes,
parenthood relationship.                                            respectively.
                                                                       4.1.3 HANDLE vs. goldMEDAL. goldMEDAL can also gener-
        Table 2: goldMEDAL and MEDAL concepts
                                                                    alize HANDLE’s concepts (Table 4). Data entity generalizes both
                                                                    data and metadata, since a data entity is a representation of data
           goldMEDAL         MEDAL                                  that also contains metadata properties. Grouping generalizes
           Data entity       Version, Representation                three concepts: Categorization, ZoneIndicator, and Granulari-
                                                                    tyIndicator. Finally, process has no direct match in HANDLE,
           Grouping          Object, Grouping                       although its authors show processes can be modeled through
           Link              Similarity link                        Action metadata instances of HANDLE’s categorization exten-
                                                                    sion [5].
           Process           Update, Transformation,
                             Parenthood relationship
                                                                           Table 4: goldMEDAL and HANDLE concepts

   Note that we do not mention in this comparison global meta-
                                                                             goldMEDAL        HANDLE
data existing in MEDAL. We indeed consider that elements such
as logs or indexes mostly induce implementation rather than                  Data entity      Data, Metadata
metadata modeling issues.                                                    Grouping         Categorization, ZoneIndicator
   Yet, other forms of global metadata, namely semantic resources                             GranularityIndicator
such as thesauruses and ontologies, can definitely be modeled
with goldMEDAL using the node, grouping and link concepts.                   Link             Link

   4.1.2 Ravat and Zhao’s Metadata Model vs. goldMEDAL. gold-                Process          —
MEDAL can handle nearly all concepts of Ravat and Zhao’s meta-
data model [18] (Table 3). Data entity generalizes the concept         Handling multiple granularity levels as in HANDLE was not
of dataset and all its subclasses, such as Datalake_Datasets or     supported by MEDAL, so it was a design objective for goldMEDAL.
Source _Datasets. Grouping generalizes the concepts of keyword.     Although there is no explicit granularity indicator in goldMEDAL,
Finally, link and process directly correspond to relationship and   any data entity could have a granularity property. However, there
process, respectively.                                              is more efficient way by defining data entities on the finest possi-
                                                                    ble granularity level. Then, coarser granularity levels are obtained
     Table 3: goldMEDAL and Ravat & Zhao concepts                   with groupings. For example, if each data entity corresponds to
                                                                    a tuple in a relational database, then a grouping represent a set
               goldMEDAL        Ravat & Zhao                        of tables.

               Data entity      Dataset, Subclass                   4.2    goldMEDAL Physical Models
               Grouping         Keyword                             To show that goldMEDAL can model different business issues
               Link             Relationship                        and manage various functionalities while remaining as simple as
                                                                    possible, we apply our metadata model to three different use cases.
               Process          Process                             We also exemplify how goldMEDAL’s logical model (Section 3.2)
               —                User, Access                        can be translated into different physical models.
                                                                       4.2.1 Public Housing Data Lake. For social landlords (agents
  However, two concepts of Ravat and Zhao’s metadata model,         or agencies providing social housing), the use of data is noth-
namely user and access, have no explicit equivalent in goldMEDAL,   ing new, whether through business intelligence for patrimony
management or with data science methods for non-payment fore-           figure, a data entity node is highlighted: some of its attributes
casting. However, landlords are facing two main problems. On            are depicted at the bottom in grey.
the one hand, their analyses are conducted separately: in different        The left-hand side of Figure 3 gives an example of groupings.
environments, by different individuals and with different tools.        There are three groupings: a zone grouping, a format grouping
This implies that collaborative work on the same data is impossi-       and a granularity grouping. Each grouping has its group nodes,
ble. On the other hand, landlords know how to use their data, but       colored in green, purple and blue, respectively. Data entity nodes
have much more difficulty capturing and exploiting “external”           are connected to group nodes with an edge. For example, we can
data. Yet their dwellings are located in environments with their        see that the highlighted data entity node (on the left) is a raw
own characteristics (transportation, climate, employment rate,          .csv file, and the granularity level is “Tenant”, meaning that each
education, etc.), which affect the attractiveness of the dwellings.     line corresponds to a tenant. Note that in Neo4J, groupings are
Being able to combine this external information with landlords’         also modeled as nodes, but are not represented in this Figure.
data would be a real asset for understanding their patrimony.              An example of process is depicted on the right-hand side of
    A data lake can store both “internal” data from social landlords    Figure 3. The process node is colored in yellow. We can see
as well as “external” data gathered on the Internet. In addition,       that three data entity nodes are the process’ input, and three
all types of analyses can be carried out from the data lake.            data entity nodes are the process’ output, meaning that they are
                                                                        generated by the process.
  HOUDAL (public HOUsing DAta Lake). The data lake imple-
                                                                           HOUDAL is operational and is currently being tested by social
mented for social landlords [21] is based on a Web application,
                                                                        landlords. Nevertheless, we have many areas for improvement
and thus is composed of two major parts: the front-end (or client
                                                                        to work on, to make the application more robust and more user-
part) is the user interface for depositing new data, for creating
                                                                        friendly. In addition, we continue to discuss with social landlords
new metadata and for consulting existing metadata; the back-
                                                                        to identify new needs, which could be the subject of future work
end (or server part) features various services such as an API, the
                                                                        to add a new feature to our data lake.
metadata system, data storage, and a user management service.
   HOUDAL Metadata System. goldMEDAL’s metadata model has                  4.2.2 Textual and Tabular Data Lake. The AUDAL data lake
been implemented into the Neo4J graph database management               is motivated by researchers in management science who want
system1 . Since Neo4J does not allow to have hyperedges, we             to analyze the effect of servicization (i.e., the transition from
create a node for each concept. Thus, entities, groups, groupings,      supplying products to supplying services) and digitization on
links and processes translate as nodes, each bearing a label and        small and medium sized companies’ economic performance [2].
attributes.                                                             Source data are various textual documents (annual reports, press
                                                                        releases, websites, social media posts) and spreadsheet files fea-
   Data entities. The different data files that populate the data       turing qualitative (e.g., stocks) and qualitative (e.g., degree of
lake are data entities. They can be either raw data files sent by       servicization) characteristics.
landlords (often in comma separated value files) or reworked
data, sometimes stored in various formats such as .pkl or .RData,          Metadata Management in AUDAL. AUDAL’s metadata sys-
for Python and R analyses, respectively. Each data entity has its       tem is architectured in three levels. The first level manages data
node labeled :ENTITY and the entity’s properties, such as file          entities. Data entities, i.e., textual documents and spreadsheet
name or description, are stored in the node’s attributes.               tables, are categorized as raw and refined. Raw tables or docu-
                                                                        ments are actually pointers to the corresponding files in their
   Groupings for Categorizing Data Entities. With HOUDAL, users         original format. Raw data entities store metadata properties, in
can create as many groupings as necessary, and several groups           the form of Neo4J node attributes, e.g., file author(s), date of cre-
for each grouping. Data entities can be linked to zero, one or sev-     ation, etc. Refined data entities are automatically generated from
eral groups for each grouping. In Neo4J, groupings are modeled          raw data entities. They are transformed so as to be exploited in
by nodes carrying a :GROUPING label. Groups are also nodes,             analyses. More concretely, raw textual documents are refined
carrying both a :GROUP label and the grouping’s name as a sec-          into bag-of-word vectors or document embedding vectors stored
ond label, in order to facilitate querying. A data entity node (resp.   in the MongoDB document-oriented database management sys-
group node) is linked to a group node (resp. grouping node) with        tem2 , and referenced from Neo4J nodes (Figure 4). Similarly, raw
an edge labeled with the grouping’s name (resp. :GROUPING).             spreadsheet tables are refined in relational tables to benefit from
With groups and groupings, users can, for example, determine            SQL querying.
whether it is internal or external data, or the data refinement            The second level in AUDAL’s metadata system handles rela-
level (zones), and so on.                                               tionships between data items. We use two kinds of relationships
   Processes for Tracking Data Lineage. Like other goldMEDAL            in accordance with goldMEDAL concepts: groupings and (sim-
concepts, a process is also modeled by a node in Neo4J, bearing         ilarity) links. Some of the groupings relate to both tabular and
the :PROCESS label. A process can be a script for transforming          textual data, e.g., groupings on the MIME type or data source.
or cleaning a data file, i.e., a data entity. If a data entity is the   Conversely, others are relevant for only one type of data, e.g., the
input of a process, there is an edge labeled :PROCESS_IN from           grouping on the language of documents. We materialize group-
the entity node to the process node. Inversely, an edge labeled         ings in Neo4J through a set of nodes. Each grouping is a simple
:PROCESS_OUT from the process node to the entity node is                node with which all associated groups are linked. Then, groups
created if a new data entity is generated by the process.               are in turn linked to the corresponding data entities.
                                                                           We define two types of links with respect to the type of
  Example. Figure 3 presents a sample of metadata stored in             data they relate to. Document similarity links express how much
Neo4J. Data entity nodes are colored in red. On both sides of the
1 https://neo4j.com                                                     2 https://www.mongodb.com
                                              Figure 3: HOUDAL sample Neo4J metadata


two documents are similar. These links are materialized by non-            Archaeological data may bear many different types, e.g., tex-
oriented edges between data entity nodes in Neo4J. Similarly, we        tual documents (excavation reports), images (photographs, draw-
express links between tabular data with Table joinability links.        ings, plans...), sensor data, chemical analysis results, etc. Even
Such links (labeled PK_FK_LINK in Figure 4) actually represent          structured data are often produced by various devices that are
some automatically detected functional dependencies between             not compatible with each other. Moreover, the description of an
columns from different tables. In Neo4J, table joinability edges        archaeological object also differs with respect to users, usages
are oriented.                                                           and time. Thus, archaeologists use semantic resources such as
   Eventually, our model’s third level is constituted of metadata       thesauruses to interoperate data from various origins.
used to speed up or enhance analyses. It includes indexes that
allow and speed up keyword-based search on textual documents                Physical Model of Data Entities. The implementation of Ar-
as well as spreadsheet files. These indexes are managed by Elas-        chaeoDAL heavily relies on the Apache ecosystem. In particular,
ticSearch3 . Moreover, AUDAL’s metadata system also includes se-        its metadata system rests on the Atlas4 data governance and meta-
mantic resources, i.e., dictionaries and thesaurus. Such resources,     data framework. Atlas’ objects match with goldMEDAL’s data
stored in MongoDB, allow amongst other automatic query exten-           entities. In addition to metadata properties (in the form of key-
sion.                                                                   value pairs), objects may also relate to terms from thesauruses, i.e.,
                                                                        goldMEDAL links, and classifications, i.e., goldMEDAL groupings
   Analyses with AUDAL. AUDAL allows both data retrieval and            (Figure 5).
content analyses. Data retrieval works in three different ways.             Moreover, we exploit Atlas’ object types to fulfill domain-
The first way exploits indexes to allow term-based queries. It is       specific requirements regarding metadata properties. For exam-
effective for both textual documents and tabular data. AUDAL            ple, in the HyperThesau project, users need not only semantic
also provides navigation as a solution to discover data of interest.    metadata to understand data contents, but also geographical
This is done by intersecting groups from different groupings. For       metadata to know where archaeological objects were discovered.
example, such queries allow finding data from a specific source         The benefits of having an object type system include:
and created on a specified year. Finally, data can be retrieved using         • consistency: a universal definition of metadata can avoid
relatedness, starting from a specified data object and then finding             terminological variations that may cause data retrieval
the most related data, namely similar documents or joinable                     problems;
tables.                                                                       • flexibility: a domain-specific type system helps define spe-
   Content analyses are actually a way to aggregate data. In                    cific metadata for requirements in each use case;
the case of textual documents, such analyses include document                 • efficiency: with a given metadata type system, it is easy
clustering or scoring with respect to a set of keywords and text                to write and implement search queries. Because names
concordance. Tabular data are exploited through SQL queries,                    and types of all metadata properties are known in ad-
the clustering of table rows and correlation analyses between                   vance, we can filter data with metadata predicates such as
columns.                                                                        𝑢𝑝𝑙𝑜𝑎𝑑_𝑑𝑎𝑡𝑒 > ‘10/02/2016’.
   4.2.3 Archaeological Data Lake. This data lake was designed             Physical Model of Processes. Atlas also includes a nice lineage
during the course of the multidisciplinary project “Hyper the-          feature that helps visualize chains of processes. For instance,
saurus and data lakes: Mine the city and its archaeological archives”   Figure 6 represents a simple ingestion process of raw data stored
(HyperThesau) [2, 12]. Let us name it ArchaeoDAL, in echo to            in HDFS into a Hive table, where objects are symbolized by blue
HOUDAL and AUDAL, though it was actually never called so.               hexagons and the process by a green hexagon.

3 https://www.elastic.co                                                4 https://atlas.apache.org
                                            Figure 4: AUDAL sample Neo4J metadata




                                                   Figure 5: Sample Atlas object


   Thesauruses and Links. The HyperThesau project heavily relies    a category may have only one parent. A category without a
on thesauruses to organize data. A thesaurus consists of a set of   parent is called the root category. Conversely, a category may
categories and terms that help regroup data. In Atlas’ glossary,    have several subcategories or terms. A term must have a parent
                                                   Figure 6: Sample Atlas lineage


category but no subcategory. A term may have relationships           metadata system, which allows non-data or non-computer scien-
(i.e., goldMEDAL links) with other terms, e.g., related words,       tists to transform and analyze their own data in autonomy, just
synonyms, antonyms, etc. Note that it would be easy to represent     as dynamic reports are prepared on top of data warehouses for
ontologies or taxonomies, too.                                       the use of business (i.e, non technical) users. However, such a
    Eventually, we add specific links between data nodes associ-     software layer must not become yet another black box. In conse-
ated with term nodes from the thesaurus. The left-hand side of       quence, we must take great care of accompanying users in their
Figure 7 displays an excerpt of the thesaurus. Figure 7 also shows   appropriation of our analysis tools, not only by training, but also
how a term (arme défensive, i.e., defensive weapon) points to the    by interweaving research methodologies from computer science
corresponding metadata (short and long descriptions) and related     with business practices by design, in close collaboration with the
terms.                                                               partners.
                                                                         Moreover, exploiting a data lake and its metadata system may
5   CONCLUSION                                                       contribute to open data and open science. A well-designed data
                                                                     lake should indeed readily enforce the four FAIR principles5 ,
In this paper, we introduced goldMEDAL, a generic data lake
                                                                     i.e., findability, accessibility, interoperability and reusability. By
metadata model. goldMEDAL is based on four main concepts:
                                                                     adding an industrialization layer that allows non-data or non-
data entity, grouping, link and process, which are defined at
                                                                     computer scientist exploit the data lake, we can further improve
the conceptual and logical levels. These concepts interact alto-
                                                                     accessibility in a non-technical way, i.e., not only through suitable
gether to support data lake metadata management requirements
                                                                     communication protocols. FAIR principles are very appealing to
and they generalize almost all the concepts proposed in state-
                                                                     researchers in humanities and social sciences, as illustrated by
of-the-art metadata models : the concept of grouping supports
                                                                     AUDAL (management sciences; Section 4.2.2) and ArchaeoDAL
the organization of data lakes in zones [18]; groupings allow
                                                                     (archaeology; Section 4.2.3).
managing multiple data granularity levels as in HANDLE [5].
                                                                         Finally, to the best of our knowledge, the maintenance of data
   Moreover, goldMEDAL supports all the features identified to
                                                                     lake metadata is a completely open issue. For instance, how to
compare data lake metadata models (Section 2.2), making it the
                                                                     manage a new categorization of metadata? How to change or
most generic metadata model to the best of our knowledge.
                                                                     transform the metadata system when it hits some limits, whether
   Another particularity of goldMEDAL is the explicit possibility
                                                                     technical or functional? What if metadata become big in the sense
of data lineage tracing with the concept of process. goldMEDAL
                                                                     of voluminous big data? Should obsolete data be deleted, which
thus manages the dynamics of data, while the most recent meta-
                                                                     is contrary to the principle of data lakes, and how to ensure that
data model from the literature, HANDLE [5], does not natively
                                                                     the metadata accessibility FAIR principle remains enforced when
support it.
                                                                     source data are no longer available?
   Eventually, we show as a proof of concept how goldMEDAL
can be translated from conceptual and logical models to actual
physical models with three different implementations of metadata
                                                                     ACKNOWLEDGEMENTS
models from distinct data lakes that feature both structured and     E. Scholly’s PhD is funded by BIAL-X6 . P.N. Sawadogo’s PhD
unstructured data.                                                   is funded by the Auvergne-Rhône-Alpes Region through the
   Future research and open issues include the “industrialization”   5 https://www.go-fair.org/fair-principles/
of data lakes, i.e., providing a software layer, connected to the    6 https://www.bial-x.com/
                                                                Figure 7: Sample Atlas thesaurus


AURA-PMI project. The HyperThesau project is funded by the                              [13] Cedrine Madera and Anne Laurent. 2016. The next information architecture
Laboratory of Excellence “Intelligence of Urban Worlds” (IMU)7 .                             evolution: the data lake wave. In International Conference on Management of
                                                                                             Digital EcoSystems (MEDES 2016), Biarritz, France. 174–180.
                                                                                        [14] Hassan Mehmood, Ekaterina Gilman, Marta Cortes, Panos Kostakos, Andrew
REFERENCES                                                                                   Byrne, Katerina Valta, Stavros Tekes, and Jukka Riekki. 2019. Implementing
                                                                                             Big Data Lake for Heterogeneous Data Sources. In International Conference on
 [1] Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar.                 Data Engineering Workshops (ICDEW 2019), Macau SAR, China (IEEE). 37–44.
     2018. CoreKG: A Knowledge Lake Service. Proceedings of the Very Large Data         [15] Natalia Miloslavskaya and Alexander Tolstoy. 2016. Big Data, Fast Data
     Base Endowment (VLDB 2018) 11, 12 (August 2018), 1942–1945.                             and Data Lake Concepts. In International Conference on Biologically Inspired
 [2] Jérôme Darmont, Cecile Favre, Sabine Loudcher, and Camille Noûs. 2020. Data             Cognitive Architectures (BICA 2016), NY, USA (Procedia Computer Science),
     Lakes for Digital Humanities. In 2nd International Digital Tools & Uses Congress        Vol. 88. 1–6.
     (DTUC 2020), Hammamet, Tunisia. ACM, New York, 38–41.                              [16] Christoph Quix and Rihan Hai. 2018. Data Lake. Encyclopedia of Big Data
 [3] Claudia Diamantini, Paolo Lo Giudice, Lorenzo Musarella, Domenico Potena,               Technologies (2018), 1–8.
     Emanuele Storti, and Domenico Ursino. 2018. A New Metadata Model to                [17] Christoph Quix, Rihan Hai, and Ivan Vatov. 2016. Metadata Extraction and
     Uniformly Handle Heterogeneous Data Lake Sources. In European Conference                Management in Data Lakes With GEMMS. Complex Systems Informatics and
     on Advances in Databases and Information Systems (ADBIS 2018), Budapest,                Modeling Quarterly 9 (December 2016), 289–293.
     Hungary. 165–177.                                                                  [18] Franck Ravat and Yan Zhao. 2019. Metadata management for data lakes. In
 [4] James Dixon. 2010.                Pentaho, Hadoop, and Data Lakes.                      European Conference on Advances in Databases and Information Systems (ADBIS
     https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-                         2019), Bled, Slovenia. Springer, 37–44.
     data-lakes/.                                                                       [19] Pegdwendé Sawadogo and Jérôme Darmont. 2021. On data lake architectures
 [5] Rebecca Eichler, Corinna Giebler, Christoph Gröger, Holger Schwarz, and                 and metadata management. Journal of Intelligent Information Systems 56, 1
     Bernhard Mitschang. 2020. HANDLE-A Generic Metadata Model for Data                      (2021), 97–120.
     Lakes. In International Conference on Big Data Analytics and Knowledge Dis-        [20] Pegdwendé N Sawadogo, Etienne Scholly, Cécile Favre, Eric Ferey, Sabine
     covery (DaWak 2020), Bratislava, Slovakia. 73–88.                                       Loudcher, and Jérôme Darmont. 2019. Metadata systems for data lakes: mod-
 [6] Ashley Farrugia, Rob Claxton, and Simon Thompson. 2016. Towards Social                  els and features. In International Workshop on BI and Big Data Applications
     Network Analytics for Understanding and Managing Enterprise Data Lakes.                 (BBIGAP@ADBIS 2019), Bled, Slovenia. Springer, 440–451.
     In Advances in Social Networks Analysis and Mining (ASONAM 2016), San              [21] Étienne Scholly. 2019. Business Intelligence & Analytics Applied to Public
     Francisco, CA, USA (IEEE). 1213–1220.                                                   Housing. In ADBIS Doctoral Consortium (DC@ADBIS 2019), Bled, Slovenia.
 [7] Corinna Giebler, Christoph Gröger, Eva Hoos, Holger Schwarz, and Bernhard               Springer, 552–557.
     Mitschang. 2019. Leveraging the Data Lake - Current State and Challenges.          [22] Isuru Suriarachchi and Beth Plale. 2016. Crossing Analytics Systems: A Case
     In International Conference on Big Data Analytics and Knowledge Discovery               for Integrated Provenance in Data Lakes. In International Conference on e-
     (DaWaK 2019), Linz, Austria.                                                            Science (e-Science 2016), Baltimore, MD, USA (IEEE). 349–354.
 [8] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelli-
     gent Data Lake System. In International Conference on Management of Data
     (SIGMOD 2016), San Francisco, CA, USA (ACM Digital Library). 2097–2100.
 [9] Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton,
     Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka
     Bhattacharyya, Shirshanka Das, Mark Donsky, Gabriel Fierro, Chang She,
     Carl Steinbach, Venkat Subramanian, and Eric Sun. 2017. Ground: A Data
     Context Service. In Biennial Conference on Innovative Data Systems Research
     (CIDR 2017), Chaminade, CA, USA.
[10] Bill Inmon. 2016. Data Lake Architecture: Designing the Data Lake and avoiding
     the garbage dump. Technics Publications.
[11] Pwint Phyu Khine and Zhao Shun Wang. 2017. Data Lake: A New Ideology
     in Big Data Era. In International Conference on Wireless Communication and
     Sensor Network (WCSN 2017), Wuhan, China (ITM Web of Conferences), Vol. 17.
     1–6.
[12] Pengfei Liu, Sabine Loudcher, Jérôme Darmont, Emmanuelle Perrin, Jean-
     Pierre Girard, and Marie-Odile Rousset. 2020. Metadata model for an archeo-
     logical data lake. Digital Humanities Conference (DH 2020), Ottawa, Canada.


7 https://imu.universite-lyon.fr