=Paper=
{{Paper
|id=Vol-3727/paper2
|storemode=property
|title=Exploring flexible models in agile MDE
|pdfUrl=https://ceur-ws.org/Vol-3727/paper2.pdf
|volume=Vol-3727
|authors=Artur Boronat
|dblpUrl=https://dblp.org/rec/conf/staf/Boronat24
}}
==Exploring flexible models in agile MDE==
Exploring flexible models in agile MDE
Artur Boronat
School of Computing and Mathematical Sciences, University of Leicester, UK
Abstract
Semi-structured data has become increasingly popular in data science and agile software development
due to its ability to handle a wide variety of data formats, which is particularly important in data lakes
where raw data is often semi-structured or ambiguous. Model-driven engineering (MDE) tools can
provide a high-level, abstract representation of a system or process, making it easier to understand and
navigate data. However, relying on data models to describe metadata of raw data can create challenges
when working with semi-structured data, which can contain errors, inconsistencies, and missing data.
In this work, we present a pragmatic approach to data-centric application development using MDE
that complements current MDE practices. Our approach uses flexible models to enable agility and
adaptability in working with data, compared to traditional metamodel-based methods. We propose a new
metamodel for characterizing such models, with the aim of enabling the development of data-centric
applications that do not require an explicit schema or metamodel. Our work demonstrates the feasibility
of working with flexible models in a wide range of model to model transformation languages, particularly
when handling semi-structured data sources.
Keywords
EMF, model transformation, flexible MDE
1. Introduction
The increasing popularity of semi-structured data in agile software development and data
science projects is due to its ability to handle a wide variety of data formats and heterogenous
data. This is particularly important in data lakes, where raw data is often semi-structured or
ambiguous, requiring data wrangling processes to infer metadata to aid in data analysis [1].
Model-driven engineering (MDE) is a software development approach that uses software
models as a central artifact to capture the structure and behavior of systems. In the case
of data, a software model can represent its structure, including the attributes, relationships,
and constraints between them. MDE tools provide a high-level, abstract representation of a
system or process, which can make it much easier to understand and navigate the data. This is
especially important when working with semi-structured data, which can be highly variable
and may not conform to a strict schema or data model. MDE tools also provide a more formal,
rigorous approach to software development, with a focus on model validation and consistency
checking. This can be helpful when working with semi-structured data, which can contain
errors, inconsistencies and missing data.
However, such reliance on data models to describe the meta-data of raw data also hinders
the application of MDE tools in this context, since the lack of such models is, precisely, what
Agile MDE 2024, co-located with STAF 2024, 08–11 July 2024, Enschede, Netherlands
Envelope-Open artur.boronat@leicester.ac.uk (A. Boronat)
Orcid 0000-0003-2024-1736 (A. Boronat)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
characterizes semi-structured data. This rigidity can create challenges in terms of modeling and
integrating data from multiple sources [2]. Moreover, the majority of MDE practitioners who
have been successful have either built their own modeling tools, made heavy adaptations of
off-the-shelf tools, or spent a lot of time working around them. These adaptations introduce
accidental complexity, which can add to the difficulties of using MDE tools [3].
In this work, we propose a pragmatic approach to data-centric application development
using the Eclipse Modeling Framework. Our approach uses flexible models to enable agility
and adaptability in working with data, compared to traditional metamodel-based methods. We
propose a new metamodel for characterizing flexible models, with the aim of enabling the devel-
opment of data-centric applications that do not require an explicit schema or metamodel. This
approach allows for the loading of flexible models from a wide range of common data formats
and can handle heterogeneous data. Our approach complements existing MDE approaches by
focussing on data extraction and rapid prototyping, enabling their application in settings where
a metamodel is not available or where data is simply too messy.
In this article, we start by explaining characteristics of semi-structured data and providing an
outline of the model transformation languages used in section 2. We introduce the conceptual
model of flexible metamodels in section 3. Next, we present a case study for analysing flexible
models in section 4. Finally, we discuss related work in section 5, and conclude with a summary
of our findings and directions for future research in section 6.
2. Background
This section provides an overview of two key areas: semi-structured data, which integrates
elements of both structured and unstructured data, and model transformation languages (MTLs),
which facilitate the mapping of input data to output data using defined rules and helper opera-
tions.
2.1. Semi-structured data
Semi-structured data combines elements of structure, like tags, with unstructured components
such as free text, without adhering to a fixed schema like a metamodel. This flexibility is
beneficial in fields like big data and data integration, enabling the handling of diverse data
sources without schema constraints. Despite its versatility, semi-structured data lacks the
rigour of structured data models, which facilitate processing, manipulation, and analysis using
Model-Driven Engineering (MDE) techniques.
It is common to encounter datasets that are stored in semi-structured formats, such as CSV
files. The dataset in Table 1 represents a collection of physical activity data recorded for a group
of patients. The data includes information about the date, time, distance, intensity, air quality,
air temperature, and heart rate of each physical activity performed by the patients. This is an
example where the dataset consists of plain tuples, where there are no further relationships
between fields of the tuple other than belonging to the tuple. The dataset contains examples
of heterogeneous data, with intensity and air quality measured using different scales and
variations in date format and units of measurement for distance and temperature. Additionally,
inconsistencies exist in the formatting of the data, with air temperature sometimes not written
Patient ID Date Time Distance Intensity Air Quality Air Temp Heart (bpm)
1 01/03/2023 10:00 2.5 miles Moderate Good 20␣C 72
1 01/03/2023 11:00 3.0 miles High Excellent 22C 78
— 1 March 2023 10:00 1.5 km 2 1 — —
2 1 March 2023 11:00 2.0 km — 3 68␣F —
Table 1
Dataset with physical activity.
consistently and the blank space between figures and units occasionally missing. Furthermore,
the dataset includes gaps that indicate a lack of data for certain fields.
2.2. Model transformation languages at a glance
This work employs two flavours of YAMTL (Yet Another Model Transformation Language) [4, 5],
which incorporate the most common features available in MTLs like ATL, ETL or RubyTL.
YAMTL can be used as an internal domain-specific language for model transformation within
JVM languages. Xtend1 is used to represent statically-typed MTLs and Groovy2 is used to
represent the dynamically-typed MTLs.
The MTLs define model transformations via transformation rules that map input objects
to output objects. Input and output patterns can comprise multiple object patterns. An input
object pattern specifies the object type and a filter (or guard) condition that must be met for
applying a rule. An output object pattern creates an output object and is specified by the output
object type and a sequence of attribute bindings that initialize the object’s features (attributes
and references). Furthermore, transformation rules may use additional helper attributes and
operations that encapsulate logic to promote reuse. An attribute is a global variable whose
value is reused across transformation rule applications, while an operation is called over a list
of arguments to perform a computation.
In the statically typed MTL (YAMTL Xtend), operation and attribute helpers are called using
the operation fetch , whereas in the dynamically typed MTL (YAMTL Groovy), the syntax is
more flexible and the name of operation and attribute helpers can be used directly without the
need for an explicit operator fetch . This operation is not needed in the dynamically typed MTL.
In both MTLs, the operation o = fetch(i) is also used to obtain an output object o that has
already been transformed from an input object i using a model transformation rule. Such an
operation is common in MTLs and corresponds to resolveTemp() in ATL and to equivalent()
in ETL.
3. Flexible models
MTLs use metamodels to ensure that models are well-formed, and that models are semantically-
correct wrt any formal OCL-like constraints of the metamodel. For example, EMF models
require a metamodel, represented as an Ecore model. Even when model management tasks are
1
https://eclipse.dev/Xtext/xtend/
2
https://groovy-lang.org/
Figure 1: The metamodel of flexible models.
programmed generically using the EMF reflective API, the metamodel is still required to be able
to access the EMF models.
In this work, we want to be able to work with EMF models independently of their metamodels,
either from a statically typed context or from a dynamically typed context. This is achieved by
means of the metamodel of flexible (untyped) models, shown in Figure 1, which defines a set of
classes to represent the structure of models that do not have a fixed metamodel. We refer to
this metamodel as the flexible metamodel in the remainder of the article. Semi-structured EMF
models will be used to represent the type of semi-structured data shown in the previous section.
The UntypedModel class represents the overall model. The ERecord class represents an instance
of a flexible (untyped) EObject , and the RecordField abstract class represents a field of such
EObject , either an AttributeField containing a value of type EJavaObject 3 , a ReferenceField
that refers to another flexible EObject , or a ContainmentField that contains other ERecord ’s,
denoting hierarchical structures of objects in the model. Each ERecord points to its parent
ERecordContainer , which is the UntypedModel instance for root records and another ERecord for
non-root records. Attribute values can refer to collections of values. This metamodel captures
the essential structure of models by segregating cross-reference references, which define graph
structures in a model, from containment references, which define hierarchical structures of
3
EJavaObject is the type use within an Ecore model to represent java.lang.Object in EMF.
objects in the model.
The UntypedModel and ERecord classes offer valuable functionalities that have been incor-
porated into the prototype, and their performance has been assessed in the case studies. The
upcoming subsections provide a detailed explanation of these classes.
3.1. ERecord class
To declare a MT on flexible models in tools that require models to conform to metamodels, MTs
need to consider the flexible metamodel in Figure 1. By using this metamodel and its model
import API , tools can load semi-structured data from popular data formats, such as CSV, JSON,
or XML, as flexible models.
The class ERecord offers a dynamically-typed interface that provides accessor and mutator
methods that encapsulate its implementation. Accessor methods define how to retrieve field
values. The accessor get(fieldName) returns the value of a field fieldName as a Java Object.
Since the flexible model operates with the absense of metamodel information, there is a
mutator operation for each type of field, with upsert semantics. That is, whenever a mutator
operation is called it either inserts the field value if not present in the ERecord and, if present, it
overrides it. To create new ERecords , MDE tools built atop the EMF API can use the standard
factory design pattern.
In the evaluation section, we have focused on the utilization of flexible models as input for
MTs, exploring both the advantages and disadvantages of this approach.
3.2. UntypedModel class
The UntypedModel class serves as an intermediary abstraction layer between databinding libraries
and model management APIs, specifically the EMF API in this study. This layer offers several
design benefits:
• Firstly, it decouples the data sources from the model management API, enabling the
extraction of semi-structured data from various heterogeneous sources in a unified way.
This means that third-party object mappers can be used to produce lists of records from
CSV, JSON, and XML serializations, among others.
• Secondly, it encapsulates design concerns that are commonly shared across various semi-
structured data sources. For example, the ability to treat a dataset as a plain or hierarchical
structure can be reused for different data formats, leading to a more modular and reusable
design.
• Third, applications built using flexible models experience consistent performance across
different data sources, as the flexible metamodel acts as a pivot language that hides
the implementation databinding details. This uniformity is achieved by abstracting the
specificities of each data format and providing a common interface, resulting in a more
efficient and maintainable design.
• Lastly, for EMF-based tools, the notion of ERecord in flexible models enables interaction
at an object-level without having to rely on extraneous third-party APIs.
Flexible models can be imported using two main methods. The first method is from semi-
structured data that is represented as List