=Paper=
{{Paper
|id=Vol-2840/paper5
|storemode=property
|title=Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling
|pdfUrl=https://ceur-ws.org/Vol-2840/paper5.pdf
|volume=Vol-2840
|authors=Etienne Scholly,Pegdwende N. Sawadogo,Pengfei Liu,Javier A. Espinosa-Oviedo,Cecile Favre,Sabine Loudcher,Jerome Darmont,Camille Nous
|dblpUrl=https://dblp.org/rec/conf/dolap/SchollySLEFLDN21
}}
==Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling==
Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling Étienne Scholly Pegdwendé N. Sawadogo Pengfei Liu Université de Lyon, Lyon 2, Université de Lyon, Lyon 2, Université de Lyon, Lyon 2, UR ERIC & BIAL-X UR ERIC UR ERIC Lyon, France Lyon, France Lyon, France etienne.scholly@bial-x.com pegdwende.sawadogo@univ-lyon2. pengfei.liu@eric.univ-lyon2.fr fr Javier A. Espinosa-Oviedo Cécile Favre Sabine Loudcher Université de Lyon, Lyon 2, Université de Lyon, Lyon 2, Université de Lyon, Lyon 2, UR ERIC-LAFMIA lab UR ERIC UR ERIC Lyon, France Lyon, France Lyon, France javier.espinosa@imag.fr cecile.favre@univ-lyon2.fr sabine.loudcher@univ-lyon2.fr Jérôme Darmont Camille Noûs Université de Lyon, Lyon 2, Université de Lyon, Lyon 2, UR ERIC Laboratoire Cogitamus Lyon, France Lyon, France jerome.darmont@univ-lyon2.fr camille.nous@cogitamus.fr ABSTRACT hardly reusable. Thus, other researchers proposed more theoreti- The rise of big data has revolutionized data exploitation prac- cal approaches named metadata models. Such approaches aim tices and led to the emergence of new concepts. Among them, to provide detailed guidelines to metadata system design, while data lakes have emerged as large heterogeneous data reposi- being generic, i.e., flexible and adaptable to many use cases. Yet, tories that can be analyzed by various methods. An efficient data lake generic metadata modeling is still an open research data lake requires a metadata system that addresses the many issue. A feature-based assessment indeed shows that none of the problems arising when dealing with big data. In consequence, existing metadata models is generic enough, including our own the study of data lake metadata models is currently an active MEtadata model for DAta Lakes (MEDAL) [20]. research topic and many proposals have been made in this re- To address this genericity issue, we introduce goldMEDAL, a gard. However, existing metadata models are either tailored for revision of our MEDAL model. We define goldMEDAL through a specific use case or insufficiently generic to manage different a classical three-level modeling process (i.e., conceptual, logical types of data lakes, including our previous model MEDAL. In and physical). We choose a formal representation to avoid ambi- this paper, we generalize MEDAL’s concepts in a new metadata guity but also provide a UML representation for readability. The model called goldMEDAL. Moreover, we compare goldMEDAL logical level is a translation of the concepts using graph theory. with the most recent state-of-the-art metadata models aiming Eventually, we describe three different physical models as proofs at genericity and show that we can reproduce these metadata of concept. Furthermore, to highlight goldMEDAL’s genericity, models with goldMEDAL’s concepts. As a proof of concept, we we show that the concepts of our metadata model help model also illustrate that goldMEDAL allows the design of various data state-of-the-art metadata models from the literature. lakes by presenting three different use cases. The remainder of this paper is organised as follows. Section 2 reviews and discusses existing data lake metadata models. Sec- tion 3 presents goldMEDAL’s conceptual and logical models. 1 INTRODUCTION Section 4 illustrates how goldMEDAL generalises other data lake metadata models and how it can be used to implement different While the big data revolution has shaken up the entire field of data data lakes. Finally, Section 5 concludes this paper and hints at management and analytics, new concepts have emerged to meet future research. these new challenges. Data lakes belong to such new concepts. First introduced by James Dixon, a data lake is a vast repository of raw and heterogeneous data from which various analyses can be performed [4]. Data lakes quickly gained popularity and several 2 RELATED WORKS teams started to address research issues [13, 15]. A key one is Metadata management plays a vital role in data lakes. Indeed, efficient metadata management for avoiding data lakes to turn in the absence of a fixed schema, data querying and analyses into unexploitable data swamps [10, 11, 16, 19, 22]. depend on an efficient metadata system. Several approaches help However, most metadata management proposals in the liter- manage metadata in data lakes. However, only a few of them ature [1, 8, 14], and their associated implementations, give few provide enough detail to ensure reusability. We refer to them details on the way data are conceptually organized and are thence as metadata models. In this section, we review state-of-the-art metadata models (Section 2.1) and compare them with respect to Copyright ©2021 for this paper by its authors. Use permitted under Creative Com- genericity (Section 2.2). mons License Attribution 4.0 International (CC BY 4.0). 2.1 Metadata Models for Data Lakes To the best of our knowledge, there exist two feature-based GEMMS (Generic and Extensible Metadata Management System) comparisons of data lake metadata models in the literature. We is a pioneer generic metadata model for data lakes [17]. GEMMS introduced six relevant features: semantic enrichment, data in- features two abstract entities: data file and data unit. A data dexing, data polymorphism, data versioning, link generation and file represents a generic data source. A data unit represents an usage tracking [20]; while Eichler et al. identified three other identifiable data element inside a data source. Each data file is features: metadata properties, zone metadata and the support of composed of a set of data units (e.g., a spreadsheet file is com- multiple granularity levels [5]. posed of a set of sheets). Data files and data units can be enriched Considering that both the above sets of features are relevant, with atomic or complex metadata values. However, GEMMS re- we propose to combine them for comparing the genericity of quires information on data structure to operate. Thus, making it metadata models. Beyond simply unioning features, we merge unsuitable for working with unstructured data. data polymorphism with zone metadata, as these features both Ground is another generic metadata model [9] that can be used refer to the same concept. We also split link generation in two for modeling metadata in data lakes (although not specifically new features, namely similarity links and categorization, because designed for that). Ground tracks data context (metadata) at three some metadata models support only one of them. Eventually, we levels: 1) metadata properties, 2) data usage history and 3) data omit data indexing in this comparison, considering that indexing versioning. Although more extensive than GEMMS, Ground (as does not actually induce metadata modeling issues. Although well as GEMMS) does not take in charge data linkage even though indexing is definitely relevant to assess metadata systems [20], this type of metadata has been identified as relevant in data this feature seems less suited to metadata models. lakes [6, 20]. All in all, we obtain a list of eight features that can serve to Based on GEMMS’ data file and data units concepts, The model compare data lake metadata models and evaluate their genericity. of Diamantini et al. adds similarity links between data units to in- (1) Semantic enrichment directly link data files [3]. However, their model does not include (2) Data polymorphism/multiple zones important metadata such as data versioning and usage tracking (3) Data versioning as compared to Ground. (4) Usage tracking Similar to Diamantini et al., Ravat and Zhao propose a model (5) Categorization where each data file can be associated with atomic and complex (6) Similarity links metadata [18], including metadata properties, data history and (7) Metadata properties links with other data files. The main contribution of this model (8) Multiple granularity levels is the notion of zone metadata. Many data lake architectures Table 1 highlights the features supported by all the models consider the existence of zones (e.g., raw data zone, processed reviewed in Section 2.1. It shows that none of them support all data zone) [7, 18]. Zone metadata specifies the zones where data is the features we identify. located. However, Ravat and Zhao’s model cannot simultaneously represent different data granularity levels as previous models 3 GOLDMEDAL METADATA MODEL do [3, 17]. Section 2.1 establishes that, of the eight criteria used to compare MEDAL represents data through three main concepts: data data lake metadata models, none ticked all the boxes. In this sec- objects, representations and versions [20]. Data objects correspond tion, we thoroughly describe goldMEDAL, a substantial evolution to GEMMS’ data files. Representations correspond to the result of MEDAL that generalizes its concepts while addressing all the of transformed objects. Versions represent objects updates. Both, features identified in Section 2.2. representations and versions, are materialized in the data lake. A metadata model can be expressed “in the form of an explicit Thus, MEDAL gives alternative ways to track data linkage and schema, a formal definition, or a textual description” [5]. In this zone metadata through the concepts of versions and represen- paper, we choose a formal approach for the sake of precision. tations, respectively. MEDAL also supports linkage metadata Yet, for the sake of readability and communication with possibly through categorizations and similarity links. However, MEDAL non-computer scientists, we also provide a semi-formal UML does not support multiple data granularity levels either. model. Moreover, we use a conventional data modeling approach Finally, HANDLE (Handling metAdata maNagement in Data that leverages a conceptual, a logical and a physical model, to LakEs), uses the generic concept of data entity to represent both, demonstrate the actual implementation process of our metadata data files and parts of data files, which helps HANDLE support model. any granularity level [5]. In HANDLE, each data entity is as- Section 3.1 presents goldMEDAL’s formal and semi-formal con- sociated with tags that represent zones, granularity levels or ceptual models. Section 3.2 details the translation of goldMEDAL’s categorizations. HANDLE can also connect data entities together concepts into a logical, graph-based model. For the sake of clarity, through containment links (e.g., between a table and a tuple). the examples we use are the same examples in both sections, i.e., HANDLE provides concepts that subsume most of the concepts examples at the conceptual level are translated at the logical level. of the previous metadata models. Eventually, example physical models, i.e., metadata models actu- ally implemented in data lakes with goldMEDAL, are presented in Section 4.2. 2.2 Genericity of Metadata Models A generic metadata model should adapt to any data lake use case. 3.1 Conceptual Model As each use case requires specific metadata management features, In MEDAL, data items were considered either as raw data, or as we consider that the most abundant features a metadata model versions or representations derived from raw data. The concepts supports, the most generic it is. Therefore, features are a suitable of version and representation were used to express updated and way to compare metadata models. transformed data, respectively. While modeling metadata for Table 1: Features supported by data lake metadata models Features ↓ \ Models → GEMMS Ground Diamantini et al. Ravat & Zhao MEDAL HANDLE goldMEDAL Semantic enrichment ✓ ✓ ✓ ✓ ✓ ✓ ✓ Polymorphism/multiple zones ✓ ✓ ✓ ✓ ✓ Data versioning ✓ ✓ ✓ ✓ Usage tracking ✓ ✓ ✓ ✓ ✓ Categorization ✓ ✓ ✓ ✓ ✓ ✓ Similarity links ✓ ✓ ✓ ✓ ✓ Metadata properties ✓ ✓ ✓ ✓ ✓ ✓ Multiple granularity levels ✓ ✓ ✓ ✓ Total 4/8 5/8 4/8 7/8 7/8 7/8 8/8 various data lakes, we found that more data items were possible, similarity links between data entities or hierarchies between e.g., temporal representations. Thus, we decided to generalize groups. For example, a temporal hierarchy month → quarter any such concepts into a global concept named data entity in would have the months of January, February and March linked goldMEDAL. to the first quarter of a given year. Accordingly, we also generalized in goldMEDAL: Definition 3.4. The set of links is denoted L = {𝑙𝑚 }𝑚 ∈N∗ , with • update and transformation operations that served to track either: the lineage of representations and versions, respectively, • 𝑙𝑚 : E → E, as well as parenthood relationships that express fusion • 𝑙𝑚 : 𝐺 𝑗 → 𝐺 𝑗 ′ and 𝑗 ≠ 𝑗 ′ . operations, into the concept of process; • similarity links into the global concept of link. Example 3.5. Let us elaborate the sample hierarchy month → quarter. Let 𝐺 3 = {𝐽𝑎𝑛, 𝐹𝑒𝑏, ..., 𝐷𝑒𝑐} a grouping of data en- Eventually, we retained in goldMEDAL the MEDAL concept of tities per month and 𝐺 4 = {𝑄1, 𝑄2, 𝑄3, 𝑄4} be a grouping of grouping, which notably allows multiple data granularity levels. quarters in a year. Now, let us make explicit some data enti- All the main goldMEDAL concepts (data entity, grouping, link ties and their groups: 𝐽𝑎𝑛 = {𝑒 1, 𝑒 2 }, 𝐹𝑒𝑣 = {𝑒 3 }, 𝑀𝑎𝑟 = {𝑒 4 }; and process) are characterized by attributes or properties that 𝑄1 = {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4 }. Link 𝑙 1 materializes the hierarchical link be- constitute their internal metadata. tween groups 𝐺 3 and 𝐺 4 : 𝐽𝑎𝑛 −→ 𝑄1, 𝐹𝑒𝑏 −→ 𝑄1, 𝑀𝑎𝑟 −→ 𝑄1. 𝑙1 𝑙1 𝑙1 3.1.1 Data Entity. Data entities are the basic units of our Inversely, 𝑄1 −−→ {𝐽𝑎𝑛, 𝐹𝑒𝑏, 𝑀𝑎𝑟 }. metadata model. They are flexible in terms of data granularity. For 𝑙 1−1 example, a data entity can represent a spreadsheet file, a textual A functional notation may also be used: 𝑙 1 (𝐽𝑎𝑛) = 𝑄1, 𝑙 1 (𝐹𝑒𝑏) = or semi-structured document, an image, a database table, a tuple 𝑄1, 𝑙 1 (𝑀𝑎𝑟 ) = 𝑄1, 𝑙 1−1 (𝑄1) = {𝐽𝑎𝑛, 𝐹𝑒𝑏, 𝑀𝑎𝑟 }. Also note that or an entire database. The introduction of any new element in 𝑄1 = 𝐽𝑎𝑛 ∪ 𝐹𝑒𝑏 ∪ 𝑀𝑎𝑟 . the data lake leads to the creation of a new data entity. 3.1.4 Process. A process refers to any transformation applied Definition 3.1. The set of data entities is denoted E = {𝑒𝑖 }𝑖 ∈N∗ . to a set of data entities that produces a new set of data entities. 3.1.2 Grouping. A grouping is a set of groups; a group brings Definition 3.6. The set of processes is denoted P = {𝑃𝑛 }𝑛 ∈N∗ , together data entities based on common properties. For example, with 𝑃𝑛 = {𝐼𝑛 , 𝑂𝑛 }, 𝐼𝑛 ⊆ E the set of input data entities of 𝑃𝑛 the raw and preprocessed data zones common in data lake archi- and 𝑂𝑛 the set of output data entities that is integrated into E tectures are the groups of a zone grouping. Another example is (E ← E ∪ 𝑂𝑛 ). a grouping of textual documents according to the language of Example 3.7. Process 𝑃1 splits a set of textual documents 𝐷 ⊆ writing. E into a set of text fragments 𝐹 ⊆ E. Here, 𝐼 1 = 𝐷 and 𝑂 1 = 𝐹 . Definition 3.2. The set of groupings is denoted G = {𝐺 𝑗 } 𝑗 ∈N∗ , 3.1.5 UML model. Figure 1 features goldMEDAL’s conceptual with 𝐺 𝑗 = {Γ𝑗𝑘 }𝑘 ∈N∗ and Γ𝑗𝑘 ⊆ E is a group. model as a UML class diagram. All the concepts of goldMEDAL, including group, are modeled as classes (data entity, grouping, Example 3.3. To get back to our previous examples, G = group and process) or association classes (entity link and group {𝐺 1, 𝐺 2 }. 𝐺 1 = {Γ11, Γ12 } is the zone grouping, with Γ11 and link, which are labeled E-Link and G-Link in Figure 1, respec- Γ12 being the raw data and processed data zones, respectively. tively). 𝐺 2 = {Γ21, Γ22 } is the language grouping, with Γ21 and Γ22 the Eventually, although they are not depicted in Figure 1, all groups corresponding to French and English languages, respec- classes and association classes bear attributes that model meta- tively. Note that the groupings 𝐺 𝑗 are deliberately not partitions data properties. These attributes may be of any type, including of E. Thus, a bilingual French-English document can belong to lists, and of course vary with respect to use cases. both groups Γ21 and Γ22 . 3.1.3 Link. Links are used to associate either data entities 3.2 Logical Model with each other or groups of data entities with each other. They As MEDAL and HANDLE did, though at the physical level, we can be oriented or not. They allow the expression of, e.g., simple choose to design goldMEDAL’s logical model as a graph, which is hyperedges representing a grouping of data entities per month. 𝐻 4 = {𝜃𝑄1, 𝜃𝑄2, 𝜃𝑄3, 𝜃𝑄4 } is a set of hyperedges representing the grouping of quarters in a year. Let us make this explicit with instances. 𝜃 𝐽 𝑎𝑛 = {𝑛 1, 𝑛 2 }, 𝜃 𝐹𝑒𝑣 = {𝑛 3 }, 𝜃 𝑀𝑎𝑟 = {𝑛 4 }; 𝜃𝑇 1 = {𝑛 1, 𝑛 2, 𝑛 3, 𝑛 4 }. Edge 𝑎 1 materializes the hierarchical link between 𝐻 3 and 𝐻 4 : 𝜃 𝐽 𝑎𝑛 −−→ 𝜃𝑄1, 𝜃 𝐹𝑒𝑏 −−→ 𝜃𝑄1, 𝜃 𝑀𝑎𝑟 −−→ 𝜃𝑄1 . Inversely, 𝑎1 𝑎1 𝑎1 𝜃𝑄1 −−−→ {𝜃 𝐽 𝑎𝑛 , 𝜃 𝐹𝑒𝑏 , 𝜃 𝑀𝑎𝑟 }. 𝑎 −1 1 3.2.4 Translation of Process. A process is modeled by an ori- ented hyperedge. Definition 3.15. The set of oriented hyperedges modeling pro- cesses is denoted Q = {Π𝑛 }𝑛 ∈N∗ , with Π𝑛 = {Υ𝑛 , Ω𝑛 }, Υ𝑛 ⊆ N being the set of input nodes of Π𝑛 and Ω𝑛 the a set of output Figure 1: UML class diagram of goldMEDAL nodes integrated to N (N ← N ∪ Ω𝑛 ). Any Π𝑛 carries attributes. Example 3.16. Π 1 = {Υ1, Ω1 } is an oriented hyperedge rep- particularly well-suited to depict relationships between different resenting the process of splitting a set of textual documents concepts. (Example 3.7) represented by the set of nodes 𝑁𝐷 ⊆ N , into a Thus, in this section, we translate the concepts defined in set of text fragments represented by the set of nodes 𝑁 𝐹 ⊆ N . Section 3.1 into graph nodes, edges and hyperedges, using the Then, Υ1 = 𝑁𝐷 and Ω1 = 𝑁 𝐹 . same indices, e.g., 𝑖, 𝑗, 𝑘... Moreover, we illustrate the translation 3.2.5 Sample Graph Representation. Figure 2 provides a sche- with the examples used at the conceptual level. Finally, we also matic representation of the examples above. Let us introduce propose a graphic illustration of goldMEDAL’s logical model. eight data entity nodes {𝑛𝑖 }𝑖 ∈ [1,8] colored in orange. 3.2.1 Translation of Data Entity. Data entities are modeled by Example 3.12 is depicted on the left-hand side of Figure 2. nodes that carry attributes. Groups of 𝐻 1 are colored in purple, while 𝐻 2 ’s are blue. We can Definition 3.8. The set of nodes is denoted N = {𝑛𝑖 }𝑖 ∈N∗ . Each see that 𝑛 1 and 𝑛 3 belong to the raw data group 𝜃 11 , while 𝑛 2 and node 𝑛𝑖 ∈ N carries attributes. 𝑛 4 are in the processed data group 𝜃 12 . Moreover, 𝑛 1 , 𝑛 2 and 𝑛 3 are in the French language group 𝜃 21 , and 𝑛 4 is in the English Example 3.9. A PDF file stored in the data lake can be repre- language group 𝜃 22 . sented by a node 𝑛 1 . Example 3.14 is represented at the center of Figure 2. Groups 3.2.2 Translation of Grouping. A group is represented by a of 𝐻 3 , namely 𝜃 𝐽 𝑎𝑛 , ..., 𝜃 𝐷𝑒𝑐 are colored in green and groups of 𝐻 4 non-oriented hyperedge, i.e., an edge that can link more than (𝜃𝑄1 , ..., 𝜃𝑄4 ) are colored in grey. Hyperedge 𝑎 1 connects groups two nodes. A grouping is modeled by a set of hyperedges. of 𝐻 3 to 𝐻 4 ’s. Finally, Example 3.16 is depicted on the right-hand side of Definition 3.10. A hyperedge (a group) is denoted 𝜃 𝑗𝑘 ⊆ N , Figure 2. 𝑛 5 is a textual document split in fragments 𝑛 6 , 𝑛 7 and with 𝑗, 𝑘 ∈ N∗ . Any 𝜃 𝑗𝑘 carries attributes. 𝑛 8 . Π 1 ’s input and output Υ1 and Ω1 , respectively, are colored in Definition 3.11. The set of hyperedges of grouping 𝑗 is denoted yellow. 𝐻 𝑗 = {𝜃 𝑗𝑘 } and carries attributes. The set of hyperedge sets (set of groupings) is denoted H . 4 GOLDMEDAL ASSESSMENT In this section, we discuss goldMEDAL’s genericity. To this end, Example 3.12. Let us translate Example 3.3. H = {𝐻 1, 𝐻 2 }. we show in Section 4.1 that all three most complete metadata 𝐻 1 = {𝜃 11, 𝜃 12 } is the set of hyperedges representing the zone models can be modeled with goldMEDAL. In Section 4.2, we grouping, with 𝜃 11 and 𝜃 12 the hyperedges representing the raw present our ongoing implementation work of goldMEDAL on data and processed data zones, respectively. 𝐻 2 = {𝜃 21, 𝜃 22 } is the distinct use cases. set of hyperedges representing the language grouping, with 𝜃 21 and 𝜃 22 the hyperedges representing the groups corresponding to French and English languages, respectively. 4.1 Comparison of State-of-the-Art Metadata Models with goldMEDAL 3.2.3 Translation of Link. Links may model relationships be- tween either data entities (nodes) or groups (hyperedges). They To evaluate goldMEDAL’s genericity, we compare it with the are modeled by edges. three metadata models that are both the most recent and the most complete among metadata models, i.e., MEDAL, Ravat and Definition 3.13. The set of edges is denoted A = {𝑎𝑚 }𝑚 ∈N∗ , Zhao’s and HANDLE (Section 2.2). with any 𝑎𝑚 being either: For each comparison, we use a two-column table. The first • an edge, oriented or not, connecting two nodes. Then, column lists goldMEDAL’s concepts, and the second column 𝑎𝑚 = (𝑛𝑖 , 𝑛𝑖 ′ ) ∈ N 2 ; the corresponding concepts of the compared model. When any • an oriented edge connecting two hyperedges. Then, 𝑎𝑚 = concept does not have an equivalent, it is marked with “—”. (𝜃 𝑗𝑘 , 𝜃 𝑗 ′𝑘 ′ ) ∈ 𝐻 𝑗 × 𝐻 𝑗 ′ . 4.1.1 MEDAL vs. goldMEDAL. goldMEDAL’s four main con- In both cases, the edge carries attributes. cepts help generalize all of MEDAL’s concepts (Table 2). Data Example 3.14. To get back to the sample hierarchy month → entity generalizes the concepts of version and representation. quarters from Example 3.5, 𝐻 3 = {𝜃 𝐽 𝑎𝑛 , 𝜃 𝐹𝑒𝑏 , ..., 𝜃 𝐷𝑒𝑐 } is a set of Grouping generalizes the concepts of object and grouping (in Figure 2: Sample goldMEDAL graph logical model the sense of MEDAL). Link generalizes the concepts of similar- though they could be classified as global metadata. Users and ity link. Finally, process generalizes transformation, update and accesses can indeed be modeled as data entities and processes, parenthood relationship. respectively. 4.1.3 HANDLE vs. goldMEDAL. goldMEDAL can also gener- Table 2: goldMEDAL and MEDAL concepts alize HANDLE’s concepts (Table 4). Data entity generalizes both data and metadata, since a data entity is a representation of data goldMEDAL MEDAL that also contains metadata properties. Grouping generalizes Data entity Version, Representation three concepts: Categorization, ZoneIndicator, and Granulari- tyIndicator. Finally, process has no direct match in HANDLE, Grouping Object, Grouping although its authors show processes can be modeled through Link Similarity link Action metadata instances of HANDLE’s categorization exten- sion [5]. Process Update, Transformation, Parenthood relationship Table 4: goldMEDAL and HANDLE concepts Note that we do not mention in this comparison global meta- goldMEDAL HANDLE data existing in MEDAL. We indeed consider that elements such as logs or indexes mostly induce implementation rather than Data entity Data, Metadata metadata modeling issues. Grouping Categorization, ZoneIndicator Yet, other forms of global metadata, namely semantic resources GranularityIndicator such as thesauruses and ontologies, can definitely be modeled with goldMEDAL using the node, grouping and link concepts. Link Link 4.1.2 Ravat and Zhao’s Metadata Model vs. goldMEDAL. gold- Process — MEDAL can handle nearly all concepts of Ravat and Zhao’s meta- data model [18] (Table 3). Data entity generalizes the concept Handling multiple granularity levels as in HANDLE was not of dataset and all its subclasses, such as Datalake_Datasets or supported by MEDAL, so it was a design objective for goldMEDAL. Source _Datasets. Grouping generalizes the concepts of keyword. Although there is no explicit granularity indicator in goldMEDAL, Finally, link and process directly correspond to relationship and any data entity could have a granularity property. However, there process, respectively. is more efficient way by defining data entities on the finest possi- ble granularity level. Then, coarser granularity levels are obtained Table 3: goldMEDAL and Ravat & Zhao concepts with groupings. For example, if each data entity corresponds to a tuple in a relational database, then a grouping represent a set goldMEDAL Ravat & Zhao of tables. Data entity Dataset, Subclass 4.2 goldMEDAL Physical Models Grouping Keyword To show that goldMEDAL can model different business issues Link Relationship and manage various functionalities while remaining as simple as possible, we apply our metadata model to three different use cases. Process Process We also exemplify how goldMEDAL’s logical model (Section 3.2) — User, Access can be translated into different physical models. 4.2.1 Public Housing Data Lake. For social landlords (agents However, two concepts of Ravat and Zhao’s metadata model, or agencies providing social housing), the use of data is noth- namely user and access, have no explicit equivalent in goldMEDAL, ing new, whether through business intelligence for patrimony management or with data science methods for non-payment fore- figure, a data entity node is highlighted: some of its attributes casting. However, landlords are facing two main problems. On are depicted at the bottom in grey. the one hand, their analyses are conducted separately: in different The left-hand side of Figure 3 gives an example of groupings. environments, by different individuals and with different tools. There are three groupings: a zone grouping, a format grouping This implies that collaborative work on the same data is impossi- and a granularity grouping. Each grouping has its group nodes, ble. On the other hand, landlords know how to use their data, but colored in green, purple and blue, respectively. Data entity nodes have much more difficulty capturing and exploiting “external” are connected to group nodes with an edge. For example, we can data. Yet their dwellings are located in environments with their see that the highlighted data entity node (on the left) is a raw own characteristics (transportation, climate, employment rate, .csv file, and the granularity level is “Tenant”, meaning that each education, etc.), which affect the attractiveness of the dwellings. line corresponds to a tenant. Note that in Neo4J, groupings are Being able to combine this external information with landlords’ also modeled as nodes, but are not represented in this Figure. data would be a real asset for understanding their patrimony. An example of process is depicted on the right-hand side of A data lake can store both “internal” data from social landlords Figure 3. The process node is colored in yellow. We can see as well as “external” data gathered on the Internet. In addition, that three data entity nodes are the process’ input, and three all types of analyses can be carried out from the data lake. data entity nodes are the process’ output, meaning that they are generated by the process. HOUDAL (public HOUsing DAta Lake). The data lake imple- HOUDAL is operational and is currently being tested by social mented for social landlords [21] is based on a Web application, landlords. Nevertheless, we have many areas for improvement and thus is composed of two major parts: the front-end (or client to work on, to make the application more robust and more user- part) is the user interface for depositing new data, for creating friendly. In addition, we continue to discuss with social landlords new metadata and for consulting existing metadata; the back- to identify new needs, which could be the subject of future work end (or server part) features various services such as an API, the to add a new feature to our data lake. metadata system, data storage, and a user management service. HOUDAL Metadata System. goldMEDAL’s metadata model has 4.2.2 Textual and Tabular Data Lake. The AUDAL data lake been implemented into the Neo4J graph database management is motivated by researchers in management science who want system1 . Since Neo4J does not allow to have hyperedges, we to analyze the effect of servicization (i.e., the transition from create a node for each concept. Thus, entities, groups, groupings, supplying products to supplying services) and digitization on links and processes translate as nodes, each bearing a label and small and medium sized companies’ economic performance [2]. attributes. Source data are various textual documents (annual reports, press releases, websites, social media posts) and spreadsheet files fea- Data entities. The different data files that populate the data turing qualitative (e.g., stocks) and qualitative (e.g., degree of lake are data entities. They can be either raw data files sent by servicization) characteristics. landlords (often in comma separated value files) or reworked data, sometimes stored in various formats such as .pkl or .RData, Metadata Management in AUDAL. AUDAL’s metadata sys- for Python and R analyses, respectively. Each data entity has its tem is architectured in three levels. The first level manages data node labeled :ENTITY and the entity’s properties, such as file entities. Data entities, i.e., textual documents and spreadsheet name or description, are stored in the node’s attributes. tables, are categorized as raw and refined. Raw tables or docu- ments are actually pointers to the corresponding files in their Groupings for Categorizing Data Entities. With HOUDAL, users original format. Raw data entities store metadata properties, in can create as many groupings as necessary, and several groups the form of Neo4J node attributes, e.g., file author(s), date of cre- for each grouping. Data entities can be linked to zero, one or sev- ation, etc. Refined data entities are automatically generated from eral groups for each grouping. In Neo4J, groupings are modeled raw data entities. They are transformed so as to be exploited in by nodes carrying a :GROUPING label. Groups are also nodes, analyses. More concretely, raw textual documents are refined carrying both a :GROUP label and the grouping’s name as a sec- into bag-of-word vectors or document embedding vectors stored ond label, in order to facilitate querying. A data entity node (resp. in the MongoDB document-oriented database management sys- group node) is linked to a group node (resp. grouping node) with tem2 , and referenced from Neo4J nodes (Figure 4). Similarly, raw an edge labeled with the grouping’s name (resp. :GROUPING). spreadsheet tables are refined in relational tables to benefit from With groups and groupings, users can, for example, determine SQL querying. whether it is internal or external data, or the data refinement The second level in AUDAL’s metadata system handles rela- level (zones), and so on. tionships between data items. We use two kinds of relationships Processes for Tracking Data Lineage. Like other goldMEDAL in accordance with goldMEDAL concepts: groupings and (sim- concepts, a process is also modeled by a node in Neo4J, bearing ilarity) links. Some of the groupings relate to both tabular and the :PROCESS label. A process can be a script for transforming textual data, e.g., groupings on the MIME type or data source. or cleaning a data file, i.e., a data entity. If a data entity is the Conversely, others are relevant for only one type of data, e.g., the input of a process, there is an edge labeled :PROCESS_IN from grouping on the language of documents. We materialize group- the entity node to the process node. Inversely, an edge labeled ings in Neo4J through a set of nodes. Each grouping is a simple :PROCESS_OUT from the process node to the entity node is node with which all associated groups are linked. Then, groups created if a new data entity is generated by the process. are in turn linked to the corresponding data entities. We define two types of links with respect to the type of Example. Figure 3 presents a sample of metadata stored in data they relate to. Document similarity links express how much Neo4J. Data entity nodes are colored in red. On both sides of the 1 https://neo4j.com 2 https://www.mongodb.com Figure 3: HOUDAL sample Neo4J metadata two documents are similar. These links are materialized by non- Archaeological data may bear many different types, e.g., tex- oriented edges between data entity nodes in Neo4J. Similarly, we tual documents (excavation reports), images (photographs, draw- express links between tabular data with Table joinability links. ings, plans...), sensor data, chemical analysis results, etc. Even Such links (labeled PK_FK_LINK in Figure 4) actually represent structured data are often produced by various devices that are some automatically detected functional dependencies between not compatible with each other. Moreover, the description of an columns from different tables. In Neo4J, table joinability edges archaeological object also differs with respect to users, usages are oriented. and time. Thus, archaeologists use semantic resources such as Eventually, our model’s third level is constituted of metadata thesauruses to interoperate data from various origins. used to speed up or enhance analyses. It includes indexes that allow and speed up keyword-based search on textual documents Physical Model of Data Entities. The implementation of Ar- as well as spreadsheet files. These indexes are managed by Elas- chaeoDAL heavily relies on the Apache ecosystem. In particular, ticSearch3 . Moreover, AUDAL’s metadata system also includes se- its metadata system rests on the Atlas4 data governance and meta- mantic resources, i.e., dictionaries and thesaurus. Such resources, data framework. Atlas’ objects match with goldMEDAL’s data stored in MongoDB, allow amongst other automatic query exten- entities. In addition to metadata properties (in the form of key- sion. value pairs), objects may also relate to terms from thesauruses, i.e., goldMEDAL links, and classifications, i.e., goldMEDAL groupings Analyses with AUDAL. AUDAL allows both data retrieval and (Figure 5). content analyses. Data retrieval works in three different ways. Moreover, we exploit Atlas’ object types to fulfill domain- The first way exploits indexes to allow term-based queries. It is specific requirements regarding metadata properties. For exam- effective for both textual documents and tabular data. AUDAL ple, in the HyperThesau project, users need not only semantic also provides navigation as a solution to discover data of interest. metadata to understand data contents, but also geographical This is done by intersecting groups from different groupings. For metadata to know where archaeological objects were discovered. example, such queries allow finding data from a specific source The benefits of having an object type system include: and created on a specified year. Finally, data can be retrieved using • consistency: a universal definition of metadata can avoid relatedness, starting from a specified data object and then finding terminological variations that may cause data retrieval the most related data, namely similar documents or joinable problems; tables. • flexibility: a domain-specific type system helps define spe- Content analyses are actually a way to aggregate data. In cific metadata for requirements in each use case; the case of textual documents, such analyses include document • efficiency: with a given metadata type system, it is easy clustering or scoring with respect to a set of keywords and text to write and implement search queries. Because names concordance. Tabular data are exploited through SQL queries, and types of all metadata properties are known in ad- the clustering of table rows and correlation analyses between vance, we can filter data with metadata predicates such as columns. 𝑢𝑝𝑙𝑜𝑎𝑑_𝑑𝑎𝑡𝑒 > ‘10/02/2016’. 4.2.3 Archaeological Data Lake. This data lake was designed Physical Model of Processes. Atlas also includes a nice lineage during the course of the multidisciplinary project “Hyper the- feature that helps visualize chains of processes. For instance, saurus and data lakes: Mine the city and its archaeological archives” Figure 6 represents a simple ingestion process of raw data stored (HyperThesau) [2, 12]. Let us name it ArchaeoDAL, in echo to in HDFS into a Hive table, where objects are symbolized by blue HOUDAL and AUDAL, though it was actually never called so. hexagons and the process by a green hexagon. 3 https://www.elastic.co 4 https://atlas.apache.org Figure 4: AUDAL sample Neo4J metadata Figure 5: Sample Atlas object Thesauruses and Links. The HyperThesau project heavily relies a category may have only one parent. A category without a on thesauruses to organize data. A thesaurus consists of a set of parent is called the root category. Conversely, a category may categories and terms that help regroup data. In Atlas’ glossary, have several subcategories or terms. A term must have a parent Figure 6: Sample Atlas lineage category but no subcategory. A term may have relationships metadata system, which allows non-data or non-computer scien- (i.e., goldMEDAL links) with other terms, e.g., related words, tists to transform and analyze their own data in autonomy, just synonyms, antonyms, etc. Note that it would be easy to represent as dynamic reports are prepared on top of data warehouses for ontologies or taxonomies, too. the use of business (i.e, non technical) users. However, such a Eventually, we add specific links between data nodes associ- software layer must not become yet another black box. In conse- ated with term nodes from the thesaurus. The left-hand side of quence, we must take great care of accompanying users in their Figure 7 displays an excerpt of the thesaurus. Figure 7 also shows appropriation of our analysis tools, not only by training, but also how a term (arme défensive, i.e., defensive weapon) points to the by interweaving research methodologies from computer science corresponding metadata (short and long descriptions) and related with business practices by design, in close collaboration with the terms. partners. Moreover, exploiting a data lake and its metadata system may 5 CONCLUSION contribute to open data and open science. A well-designed data lake should indeed readily enforce the four FAIR principles5 , In this paper, we introduced goldMEDAL, a generic data lake i.e., findability, accessibility, interoperability and reusability. By metadata model. goldMEDAL is based on four main concepts: adding an industrialization layer that allows non-data or non- data entity, grouping, link and process, which are defined at computer scientist exploit the data lake, we can further improve the conceptual and logical levels. These concepts interact alto- accessibility in a non-technical way, i.e., not only through suitable gether to support data lake metadata management requirements communication protocols. FAIR principles are very appealing to and they generalize almost all the concepts proposed in state- researchers in humanities and social sciences, as illustrated by of-the-art metadata models : the concept of grouping supports AUDAL (management sciences; Section 4.2.2) and ArchaeoDAL the organization of data lakes in zones [18]; groupings allow (archaeology; Section 4.2.3). managing multiple data granularity levels as in HANDLE [5]. Finally, to the best of our knowledge, the maintenance of data Moreover, goldMEDAL supports all the features identified to lake metadata is a completely open issue. For instance, how to compare data lake metadata models (Section 2.2), making it the manage a new categorization of metadata? How to change or most generic metadata model to the best of our knowledge. transform the metadata system when it hits some limits, whether Another particularity of goldMEDAL is the explicit possibility technical or functional? What if metadata become big in the sense of data lineage tracing with the concept of process. goldMEDAL of voluminous big data? Should obsolete data be deleted, which thus manages the dynamics of data, while the most recent meta- is contrary to the principle of data lakes, and how to ensure that data model from the literature, HANDLE [5], does not natively the metadata accessibility FAIR principle remains enforced when support it. source data are no longer available? Eventually, we show as a proof of concept how goldMEDAL can be translated from conceptual and logical models to actual physical models with three different implementations of metadata ACKNOWLEDGEMENTS models from distinct data lakes that feature both structured and E. Scholly’s PhD is funded by BIAL-X6 . P.N. Sawadogo’s PhD unstructured data. is funded by the Auvergne-Rhône-Alpes Region through the Future research and open issues include the “industrialization” 5 https://www.go-fair.org/fair-principles/ of data lakes, i.e., providing a software layer, connected to the 6 https://www.bial-x.com/ Figure 7: Sample Atlas thesaurus AURA-PMI project. The HyperThesau project is funded by the [13] Cedrine Madera and Anne Laurent. 2016. The next information architecture Laboratory of Excellence “Intelligence of Urban Worlds” (IMU)7 . evolution: the data lake wave. In International Conference on Management of Digital EcoSystems (MEDES 2016), Biarritz, France. 174–180. [14] Hassan Mehmood, Ekaterina Gilman, Marta Cortes, Panos Kostakos, Andrew REFERENCES Byrne, Katerina Valta, Stavros Tekes, and Jukka Riekki. 2019. Implementing Big Data Lake for Heterogeneous Data Sources. In International Conference on [1] Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar. Data Engineering Workshops (ICDEW 2019), Macau SAR, China (IEEE). 37–44. 2018. CoreKG: A Knowledge Lake Service. Proceedings of the Very Large Data [15] Natalia Miloslavskaya and Alexander Tolstoy. 2016. Big Data, Fast Data Base Endowment (VLDB 2018) 11, 12 (August 2018), 1942–1945. and Data Lake Concepts. In International Conference on Biologically Inspired [2] Jérôme Darmont, Cecile Favre, Sabine Loudcher, and Camille Noûs. 2020. Data Cognitive Architectures (BICA 2016), NY, USA (Procedia Computer Science), Lakes for Digital Humanities. In 2nd International Digital Tools & Uses Congress Vol. 88. 1–6. (DTUC 2020), Hammamet, Tunisia. ACM, New York, 38–41. [16] Christoph Quix and Rihan Hai. 2018. Data Lake. Encyclopedia of Big Data [3] Claudia Diamantini, Paolo Lo Giudice, Lorenzo Musarella, Domenico Potena, Technologies (2018), 1–8. Emanuele Storti, and Domenico Ursino. 2018. A New Metadata Model to [17] Christoph Quix, Rihan Hai, and Ivan Vatov. 2016. Metadata Extraction and Uniformly Handle Heterogeneous Data Lake Sources. In European Conference Management in Data Lakes With GEMMS. Complex Systems Informatics and on Advances in Databases and Information Systems (ADBIS 2018), Budapest, Modeling Quarterly 9 (December 2016), 289–293. Hungary. 165–177. [18] Franck Ravat and Yan Zhao. 2019. Metadata management for data lakes. In [4] James Dixon. 2010. Pentaho, Hadoop, and Data Lakes. European Conference on Advances in Databases and Information Systems (ADBIS https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and- 2019), Bled, Slovenia. Springer, 37–44. data-lakes/. [19] Pegdwendé Sawadogo and Jérôme Darmont. 2021. On data lake architectures [5] Rebecca Eichler, Corinna Giebler, Christoph Gröger, Holger Schwarz, and and metadata management. Journal of Intelligent Information Systems 56, 1 Bernhard Mitschang. 2020. HANDLE-A Generic Metadata Model for Data (2021), 97–120. Lakes. In International Conference on Big Data Analytics and Knowledge Dis- [20] Pegdwendé N Sawadogo, Etienne Scholly, Cécile Favre, Eric Ferey, Sabine covery (DaWak 2020), Bratislava, Slovakia. 73–88. Loudcher, and Jérôme Darmont. 2019. Metadata systems for data lakes: mod- [6] Ashley Farrugia, Rob Claxton, and Simon Thompson. 2016. Towards Social els and features. In International Workshop on BI and Big Data Applications Network Analytics for Understanding and Managing Enterprise Data Lakes. (BBIGAP@ADBIS 2019), Bled, Slovenia. Springer, 440–451. In Advances in Social Networks Analysis and Mining (ASONAM 2016), San [21] Étienne Scholly. 2019. Business Intelligence & Analytics Applied to Public Francisco, CA, USA (IEEE). 1213–1220. Housing. In ADBIS Doctoral Consortium (DC@ADBIS 2019), Bled, Slovenia. [7] Corinna Giebler, Christoph Gröger, Eva Hoos, Holger Schwarz, and Bernhard Springer, 552–557. Mitschang. 2019. Leveraging the Data Lake - Current State and Challenges. [22] Isuru Suriarachchi and Beth Plale. 2016. Crossing Analytics Systems: A Case In International Conference on Big Data Analytics and Knowledge Discovery for Integrated Provenance in Data Lakes. In International Conference on e- (DaWaK 2019), Linz, Austria. Science (e-Science 2016), Baltimore, MD, USA (IEEE). 349–354. [8] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelli- gent Data Lake System. In International Conference on Management of Data (SIGMOD 2016), San Francisco, CA, USA (ACM Digital Library). 2097–2100. [9] Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, Mark Donsky, Gabriel Fierro, Chang She, Carl Steinbach, Venkat Subramanian, and Eric Sun. 2017. Ground: A Data Context Service. In Biennial Conference on Innovative Data Systems Research (CIDR 2017), Chaminade, CA, USA. [10] Bill Inmon. 2016. Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications. [11] Pwint Phyu Khine and Zhao Shun Wang. 2017. Data Lake: A New Ideology in Big Data Era. In International Conference on Wireless Communication and Sensor Network (WCSN 2017), Wuhan, China (ITM Web of Conferences), Vol. 17. 1–6. [12] Pengfei Liu, Sabine Loudcher, Jérôme Darmont, Emmanuelle Perrin, Jean- Pierre Girard, and Marie-Odile Rousset. 2020. Metadata model for an archeo- logical data lake. Digital Humanities Conference (DH 2020), Ottawa, Canada. 7 https://imu.universite-lyon.fr