=Paper= {{Paper |id=Vol-3306/paper3 |storemode=property |title=From a prototype to a data ecosystem for experimental data and predictive models |pdfUrl=https://ceur-ws.org/Vol-3306/paper3.pdf |volume=Vol-3306 |authors=Edoardo Ramalli,Barbara Pernici |dblpUrl=https://dblp.org/rec/conf/vldb/RamalliP22 }} ==From a prototype to a data ecosystem for experimental data and predictive models== https://ceur-ws.org/Vol-3306/paper3.pdf
From a prototype to a data ecosystem for experimental data
and predictive models
Edoardo Ramalli1,* , Barbara Pernici1
1
    Politecnico di Milano, Via Giuseppe Ponzio, 34/5, Milan, 20133, Italy


                                             Abstract
                                             Data ecosystems have been a game-changer in many industrial applications and research fields, speeding up their development.
                                             The possibility of collecting large amounts of data within the same environment has also raised some common questions to
                                             all application domains, including the quality of the data collected and their reliability and trustworthiness. From experience
                                             gained collaborating with the chemical engineering field, this paper raises some discussion points related to the management
                                             of experimental data and predictive models within a data ecosystem. In fact, this type of data poses new requirements that
                                             require specific treatment before being implemented in a traditional data ecosystem.


1. Introduction                                                                                                                       ity [6] or database diversity tools [7] are fundamental
                                                                                                                                      to building reliable predictive models. Data quality has
Data ecosystems (DE), in the last years, have shown their                                                                             been proven that has a direct impact on decision-making
potential in boosting the research and the industry, hold-                                                                            activities [8], while database diversity could also have
ing a central role in many definitions of industry 4.0 [1].                                                                           relevant social implications in some domains due to the
DEs facilitates and encourages data sharing while extract-                                                                            bias presented in the dataset [9]. DEs are protagonists
ing knowledge and enhancing the comprehension of a                                                                                    also in other aspects: making data and services converge
phenomenon [2]. In some cases, the data management                                                                                    in the same system can help increase their use and trust-
features of a DE are even fundamental and a prerequisite                                                                              worthiness. More data are collected inside a DE, and
to applying data science in a big data context [3]. In addi-                                                                          more users are attracted, whom themselves bring more
tion, DEs lend themselves well to the ongoing scholarly                                                                               data. The more active users are in DE, the more the data
trends of data reuse [4]. In any case, DEs raise many                                                                                 and services are checked and used, and the more reliable
challenges that need to be addressed and tailored based                                                                               the data and the overall system are. Therefore, having
on the domain [5].                                                                                                                    data and tools in the same system is a positive vicious
    A possible application of such an information system                                                                              circle, even if starting could be very challenging.
is to use it as a collection of tools, scientific reposito-                                                                              This work presents the experience in designing and
ries, and services to improve the development process                                                                                 implementing a data ecosystem to enhance the develop-
of predictive models for physical-chemical phenomena.                                                                                 ment process of predictive models in the field of chemical
The development of these data-driven models relies on                                                                                 engineering. This DE needs to manage predictive mod-
a manually managed data set. A model computes simu-                                                                                   els, analysis results, experimental and simulated data to
lated data (or simulations) that are then manually vali-                                                                              extract insights automatically while trying to address
dated against the corresponding experimental data (or                                                                                 the typical challenges a data ecosystem faces during its
experiments). A DE in this field represents a possible                                                                                design, such as transparency and trustworthiness [10].
game-changer for several reasons.                                                                                                        The need for DEs for storing experimental data and
    First, the number of available experimental data is tiny                                                                          tools in the chemical engineering domain has emerged
when compared to other data-intensive application, even                                                                               in the last few years. First attempts to integrate data
if is growing in the last years. Experiments are expensive                                                                            together with analysis tools were made over time in the
and time-consuming, while running simulations are com-                                                                                PriMe repository [11], where some tools were provided
putationally heavy. Therefore, sharing and reusing data                                                                               in addition to data, and the need for being able to ana-
is a primary objective of the scientific community and                                                                                lyze the data production process and quality of data first
one of the principal purposes of employing a data ecosys-                                                                             emerged. Other repositories storing both experimental
tem in this domain. As in many data-driven applications,                                                                              data and tools also include systems such as ChemKED1
“you are what you eat,” and concepts such as data qual-                                                                               or ReSpeCTh [12]. However, there is a lack to support
                                                                                                                                      for an approach in the design of simulation models as
Proc. of the First International Workshop on Data Ecosystems (DEco’22),                                                               a process involving all the phases, from experimental
September 5, 2022, Sydney, Australia                                                                                                  data collection to simulation results analysis. This limi-
$ edoardo.ramalli@polimi.it (E. Ramalli);                                                                                             tation has brought or the abandonment or the creation
barbara.pernici@polimi.it (B. Pernici)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                                          http://www.chemked.com/




                                                                                                                                18
of many alternative frameworks or software focusing on              Finally, in the last stage of the project, it was removed
specific aspects (e.g., CloudFlame2 for flames data and          the constraint relaxation about the fact that only a small
simulations) that are challenging to work together.              number of people will use the framework, all belonging
   This paper discusses the emerging directions derived          to the same research group, and de-facto transforming
from the design and use of a prototype system for such           our framework into a data ecosystem.
purposes. Even if the features of a DE are well defined,            The paper is structured as follows. In Section 2, the
implementing and tailoring them in a particular domain           prototype stage of our process is introduced, also present-
and application has its unique challenges. For instance,         ing the main types of data that will be stored in the DE.
scientific repositories have well-known problems with            In Section 3, it is illustrated the framework version of the
data quality [13]. The biggest challenge concerns the            project, where design and implementation choices are
design method for our data ecosystem. A top-down strat-          made to fulfill the typical characteristics of DE. Section 4
egy requires much time in the design phase, and often            shows the challenges and consequences of implementing
consumers are not willing to wait, even if it is the best        a DE considering intellectual property data in a collabo-
approach to saving time readjusting or adding new fea-           rative environment. Finally, the data ecosystem’s open
tures. On the other side, a bottom-up approach allowed           challenges and future developments are discussed in Sec-
us to deliver a product faster, even if several iterations of    tion 5.
feedback-adjustment were required. Nevertheless, this
procedure highlighted some requirements that would
hardly have emerged with a top-down approach, given              2. Prototype
the complexity of the application domain.
                                                            In the first phase of the project, the requirements were
   In any case, four phases were primarily identified dur-
                                                            gathered and discussed continuously with the domain ex-
ing this project, as shown in Figure 1. In each phase,
                                                            perts (our stakeholders). At the end of the requirements
even if some features are not immediately needed in the
                                                            collection phase, it is essential to design properly the ar-
current product delivery, some design decisions were
                                                            chitecture and the technology necessary to implement an
made keeping in mind the final goal of delivering a data
                                                            information system suitable to meet the discussed needs.
ecosystem. Therefore, this paper presents the challenges
                                                            The resulting product of this phase is a simple prototype
and design decisions in each phase toward developing a
                                                            to check if the initial requirements are fulfilled and collect
data ecosystem for a specific application domain.
                                                            new ones. However, it is already necessary to structure
                                                            the system to be compatible with the final architecture of
      Start                                   Data          a data ecosystem, even if some of these features are not
                 Prototype   Framework
     Project                               Ecosystem        strictly necessary for this step. A detailed description of
                                                            the design decisions in this phase is reported in [14].
                                                                In a DE for the development of predictive models, it
                                                            is a game-changer to gather together experimental data,
Figure 1: The four stages of our project in the development models, and analysis tools in the same system. These
of a DE in the chemical engineering field.                  entities define what type of data the final DE should
                                                            manage: experimental data (experiment), simulated data
                                                            (simulation), models, and, eventually, analysis results.
    After the kick-off of the project and the requirement       From an architectural and implementation perspective,
collection, it was delivered the first prototype [14] in to guarantee maintainability and extensibility over time,
which the main characteristics are the creation of a repos- it is preferred to choose a micro-service architecture that
itory, with the proper database schema to collect the data, provides a few simple services, together with a relational
the architectural structures of various components of the database to store experiments, models, simulations, and
system together with the technological choices.             analyses. Then the user can request and combine them
    The second phase regards the framework creation [15]. as preferred through an HTTP API, hence separating the
If the primary purpose of the prototype phase is to collect front-end from the back-end.
feedback from the end-user, with the framework, the
need was to deliver a product that can be used daily
                                                            Experimental Data Experiments are actual experi-
by a single research group. This requirement implicitly
                                                            mental measurements about the investigation of a par-
suggests that a series of features are needed to ensure
                                                            ticular environmental condition. An experiment is, in
good data quality of the database, fault-tolerant features,
                                                            fact, correlated with other metadata that characterize, for
usability, accessibility, authentication, interoperability,
                                                            example, the experiment author, the methodology, and
and so on.
                                                            the experimental conditions. These metadata contain a
2                                                           series of information essential to classify the experiments
https://cloudflame.kaust.edu.sa/




                                                            19
                                                                             Experimental Data
                                 Journal                                     Simulated Data

                     Author                  Equipment




              Year              Experiment         Condition




                     Paper                    Subject


                                  Others
                                                                    Figure 3: Example of experimental and corresponding simu-
                                                                    lated data. Simulated data are theoretically a continuous set
                                                                    of points that the predictive model can compute. In practice,
Figure 2: Experimental data metamodel                               simulating a data point could be very expensive.



correctly. In fact, while in this area there is a progressive   of developed models has increased in the last few years.
propensity for sharing and greater availability of data,        Nevertheless, not all the models can predict all the envi-
nevertheless it remains a sector in which the order of          ronmental conditions of a domain. Therefore, the meta-
magnitude of existing data is much lower than in other          data associated with (but not only) a predictive model is
areas such as, for instance, that of social media; there-       fundamental to study the behavior of a predictive model.
fore, it is essential to collect and correctly catalog the      The reasons behind the different capabilities of the pre-
experiments in order to encourage their reuse. The do-          dictive models vary but are mainly due to computational
main experts defined which metadata are mandatory and           expensiveness: what is known as a “detailed model” is a
which ones are optional. Thus the relational database           complete model and can predict the behavior of a domain
schema was designed accordingly. An abstract represen-          in many different conditions, but it takes a long time to
tation of the experimental data metamodel is provided           execute since it has to solve many differential equations.
in Fig. 2. The analysis tools will leverage this metadata       For this reason, simplified models are derived from the
to understand the predictive behavior of the model in           detailed ones with the cost of shrinking the prediction
specific conditions.                                            accuracy and reducing the capability to operate and pre-
   In our scenario, the primary source of experiments is        dict all the possible conditions of the domain. As in the
journal papers. Inside a paper, usually, there are multiple     case of the experiments, the domain experts define the
plots (an example in Figure 3) or data tables about the         mandatory metadata for a model (Figure 4).
measurements, where the corresponding metadata are
not uniquely tabulated but are described as narrative in        Simulated Data Simulations connect experiments to
the text. Recently, the tendency is to share the numeri-        models. Given a model and a numerical solver, it is pos-
cal data of the experiments in the supplement material          sible to simulate an experiment specifying the experi-
associated to papers, facilitating the data collection. In      mental condition to the solver, thus generating the cor-
some cases, a representation of the experimental data           responding simulated data. These generated data are
and metadata is already available in a commonly used            fundamental to performing different types of analysis on
format in the domain, such as the XML ReSpeCTh format           the experiments and on the model. For example, model
adopted in [12], and it is available with a DOI associated      validation is one of the most critical phases in the model
to it. Metadata for the published papers are extracted          development process. In this procedure, the model per-
from Scopus retrieving citation data using the search           formance is evaluated by comparing the similarity of the
APIs3 .                                                         experimental data with the corresponding simulated data,
                                                                as in Figure 3, generating one possible type of analysis
Model Predictive models are treated as black boxes              data.
that, if provided to a numerical solver, can predict a par-
ticular domain setting. Thanks to the increasing avail-
ability of data and computational resources, the number
3
    https://dev.elsevier.com/




                                                               20
                                                                 tion engines. Since the possible formats are few in our
                           Author
                                                                 case study and there is no prevalent representation for-
                                                                 mat. This strategy was the best trade-off. All data inside
                                                                 the framework are only stored in the relational database,
                                                                 following the schema defined by the experts without be-
                                                                 ing bound to use particular formats. In order to feed
           Version         Model           Condition
                                                                 and collect data from the framework, we need transla-
                                                                 tion engines for every required representation format.
                                                                 Similarly, each numerical solver accepts a configuration
                                                                 file for each simulation and produces an output file in a
                                                                 specific format. Also in this case, the use of translation
                            Others                               engines allows to be independent of the representation
                                                                 format of the data.

Figure 4: Metamodel of the ’model’ data.
                                                                 3.2. Data management
                                                                Our data ecosystem has been designed to gather in the
3. Framework                                                    same system, models, experiments, and simulations. Thanks
                                                                to this structure, as shown in Figure 5, the framework
Until now, the prototype was a proof of concept of what         acts as a man-in-the-middle that manages and shares the
can be achieved, and once it was delivered, new require-        knowledge between the four entities to generate new
ments and discussions arised from the final user. In ad-        knowledge.
dition, with the switch to the framework version, new              The downside of this conceptual architecture is that
challenges related to day-by-day use needed to be prop-         the entities are strongly connected, and incorrect data
erly addressed.                                                 could quickly impact others. For this reason, inside the
   First, the framework should manage and automate the          framework, it is introduced the concept of ownership of
entire life cycle of the data correctly, from their insertion   data to contain this hazard. In this way, it is possible,
to the exchange, with all associated implications such          once identified, to quickly identify all the erroneous data
as data errors and different representation formats of          involved. Services are provided in the framework for the
the data. Second, it is critical to integrate analysis tools    analysis of data quality and for comparing the results of
to extract knowledge from the data. As before, the de-          simulations with experimental data, as described in the
sign and implementation of the new features have to be          following sections. In addition, data management opera-
done, keeping in mind that the final goal is to create a        tions on experimental data are provided to improve the
data ecosystem for experiments and predictive models.           quality of the stored experimental basis in the repository.
A detailed description of the framework is provided in          This concept will be particularly helpful in the design of
[15]. This section focuses on the most important aspects        the roles in a data ecosystem as described in Section 4,
related to the development of a data ecosystem as a final       and therefore regulate access to data.
goal of the project emerged during this phase.
                                                                 3.3. Data quality
3.1. Data integration and exchange                              Nowadays, predictive models are increasingly data-driven,
In some domains, experimental and simulated data could          even in domains where a description with physical laws
be expensive to generate or replicate. As a result, the         of the phenomena is available. For this reason, data qual-
data are accumulated over decades (in our domain, some          ity plays a more and more central role in the model devel-
of them are from the late 40’s of the past century), wit-       opment process since it directly impacts the prediction
nessing an evolution of the representation formats over         quality. In addition, ensuring certain data quality levels
the years. Even in the last years, with the digitalization      within the DE enhances the system trustworthiness, thus
of the data, commonly agreed representation formats can         starting a loop of increasing the number of users as a
be challenging to develop since it is rare to witness a         consequence of the increased amount of collected data
perfect agreement within the scientific community about         and vice versa.
what is mandatory to represent.                                    In our domain, following the concept of fitness for use
   Interoperability is a fundamental prerequisite for a         [16], three quality dimensions have been identified: com-
data ecosystem, and for this reason, the strategy that          pleteness, consistency, and accuracy. Timeliness is not
reconciles the use of many representation formats, thus         of interest in the context of experiments and simulations,
collecting as much data as possible, is to employ transla-      even if it is often used as a quality metric, mainly for two




                                                            21
reasons: first, even if older experiments are carried out
with older and less precise instruments, they still repre-                                   Analysis


sent a valuable source of information, and their impreci-
sion should be included in their uncertainty evaluation,                          Tools
which it “just” needs to be handled correctly. Second,
since the experiments are expensive and hence rare, it is
pretty unlikely that multiple experiments are carried out          Experiment      Exploration      Exploitation   Model

in exactly the same conditions, thus “updating” the old
values. For a similar reason, since the predictive models
are deterministic, the simulated data does not change
over time if forecast with the same model, and numerical
configuration of the solver.                                                                Simulation
   In the framework, the data quality control process
is composed of two parts, one automatic and the other
manual, where the automatic control is performed right          Figure 5: The data ecosystem, with its tools, acts as a man-
after the insertion of a new data in the repository and         in-the-middle between the four types of data.
not, for example, a posteriori based on a recurrent sched-
ule. Data that does not reach the minimum data quality
requirements are immediately rejected.                         In such a way, it is possible to create a de-facto ground
   As in all the data quality applications, the rules to       truth against which can assess if an experiment is plau-
measure the data quality dimension depend on the do-           sible or wrong. However, this is not always true: if the
main, and, often, they are also implementable as auto-         experimental data are very different from the simulated
matic checks. Regarding completeness, thanks to the            data, then this is only a hint of a possible error but not a
domain knowledge provided by the experts, it is possible       certainty. Therefore this automatic approach is combined
to know which metadata is mandatory or optional and in         with a manual validation of the experimental data by an
which conditions. For example, it is usually compulsory        expert.
to express the unit of measurement and the name of the
measured property, but for some properties’ values, the
unit has not been expressed since they are adimensional.        3.4. Data FAIRness
Consistency works in a similar way: it is checked that          FAIR (Findable, Accessible, Interoperable, and Reusable)
properties of the same instance are consistent with each        [17] data have shown to bring many benefits to data
other. A typical example is an accordance between the           ecosystems. In this section, following the recommenda-
property name, like “pressure,” and a plausible unit of         tion from the literature [18], it is presented appropriate
measurements such as “bar” or “pascal.” Finally, the ac-        functionalities for each principle of the FAIR policies that
curacy of the data is considered. It is well known that         have been implemented or designed for the experimental
estimating accuracy is by far the most challenging data         data inside our data ecosystem.
quality dimension, but in a framework where experi-
ments and models are combined, it has a non-negligible          Findable Experiments are stored and used inside the
advantage.                                                      data ecosystem through a relational database that is very
   In Figure 5, the typical relation between the experi-        flexible and easily maintainable if compared to a file-
ments and the model is shown: during the model valida-          based organization of the experimental data. Neverthe-
tion procedure, the experiments are used to quantify the        less, a database representation of the experiments is not
predictive model performance. However, since the model          findable, and, for this reason, for each experiment, we cre-
is obviously not perfect, it has an (epistemic) uncertainty,    ate an XML representation of the experiment following
but it is reliable enough in many different conditions so       an XML schema that is widely accepted in the scientific
that it can be used to check if most of the information         community of experiment’s domain. The file is then au-
inside an experiment is meaningful. In fact, both the           tomatically uploaded to Zenodo to assign to it a unique
accuracy of the numerical data and the metadata could           global identifier together with other metadata that make
be tested. If the predictions differ significantly from the     the experiment searchable without necessarily using our
simulated ones, this discrepancy suggests an error in the       data ecosystem.
reported measurements or in the metadata used to set the
simulation. In other words, the model can validate the
                                                                Accessible Experiments inside our data ecosystem are
experiments. This approach foresees cross-validation of
                                                                identified both with a (numerical) primary key and the
an experiment against multiple simulated data about the
                                                                associated DOI. A primary numerical key makes imple-
same experimental condition but using different models.




                                                           22
menting the relational instances in the database easier                       2                   4
even before the DOI has been generated. Our data ecosys-                           3
tem offers data management services through a HTTP                            6                   5
                                                                                                         Computational
API, accepting typical formats of the request such as CSV,     Data Ecosystem         Worker              Resources
                                                                (Coordinator)
JSON, and XML. One of the advantages of such HTTP
API micro-services structures is that the final users are                               1. Create Job
                                                                 1                      2. Ask for a Job
not requested to use a particular software or program-                                  3. Get Job related Data
ming language or technical expertise to access data and                        Status
                                                                                        4. Execute Job
services, and they can combine them as preferred. Au-                          Check
                                                                                        5. Collect Job Results
thentication is required to use the API upon a free sign-in       Executor
                                                                                        6. Forward Job Results
request procedure. Authentication enables traceability
and accountability of the operations and helps keep a
                                                            Figure 6: Coordinator-worker architecture.
quality level of the scientific repository with respect to
an open-access configuration.

Interoperable Experiments in their XML represen-            system the more they are inclined to share data, thus im-
tation format are a plug-and-play solution. Every re-       proving the overall system and starting a virtuous circle.
searcher can use them as preferred, paying attention to     Such tools generate new knowledge about the data or the
the definition of each XML tag. If the experiments are      domain and increase the awareness and insight on data.
accessed through the HTTP API, the same vocabulary          In fact, it is central, for example, the concept of database
of the XML representation format is used to query the       coverage or diversity. In all data-driven models, “you
database and for the responses.                             are what you eat,” and therefore if a model is generated
                                                            only using data that represent a restricted portion of a
Reusable One of the primary purposes of the data domain, the model will be able to more or less correctly
ecosystem is to reuse data, encourage their sharing among predict only what it has already seen. Drawbacks of such
institutions and avoid duplicates. Experimental data can an approach could lead to ethical problems since classifi-
be uniquely cataloged through some metadata. Devel- cation, and regression models could have strong biases
oping the database around the uniqueness constraint of based on the diversity and the balance of the data used to
these metadata allows us to maximize the reuse.             generate them. A predictive model for physical domains
                                                            suffers from the same hazard: data are mainly used for
                                                            the validation phase. If the model is validated against a
3.5. Data generation and analysis                           large amount of data but not diverse, the predictive model
Thanks to the model, we can theoretically generate an performances could be astonishing, but in practice, they
infinite number of simulated data, and similarly, using could be much worst.
the analysis tools and combining them as we prefer, we
can create a vast number of analysis data. Neglecting the
space needed to store such quantities of data, the first
                                                            4. Data ecosystem
limitation that makes this idea unfeasible is the amount The final stage of this evolution regards the transition
of computational resources needed to generate them. A from a framework to a data ecosystem. In this last evo-
centralized architecture where all the computational bur- lution stage, what is important to investigate is how the
den is on a single organization is not sustainable. Even if framework that has been actively used by one research
the cost is shared, the bureaucracy behind sharing com- group should evolve to host multiple organizations and
putational resources is very complicated. The solution to many more users. This transition that seems straightfor-
this problem is a coordinator-worker architecture where ward in practice has mainly two different challenges that
the framework, i.e., the coordinator, collects the jobs and can be smoothly implemented thanks to the designed
distributes them among the workers, that in some cases choices of the previous project steps. First, activities for
can delegate the job to other machines as shown in Fig- the repository management described before, such as
ure 6. The coordinator-worker configuration is scalable experiment validation, need to be formalized in terms
and allows each user to decide how many computational of responsibility and accountability. Second, the data
resources to dedicate and use only for their jobs.          ecosystem could host data with intellectual properties
   Providing analysis tools inside the framework is a (IP) that are not yet open access but are on the data ecosys-
game-changer. The user is incentivized to stay in the tem because the final user wants to take advantage of
system and leverage the other knowledge in terms of our functionalities and analysis tools to compare, for ex-
data and tools available. The more the users stay in the ample, the quality of data. Both these challenges have in




                                                          23
                                  Experiment / Model
                                                       Open Publication
                                     Generation                                                                                Open Content
                                                                                        Data Flow
                                                                                                                               Closed Content
    Experimentalist
                                                                                        Control Flow
     Researcher                                                                                                                Authorized Access


                 Experiment / Model




                                                                                                                  Repository
        Writer                        Data Ecosystem       DQ Check                     Researcher     Policy     Collection              Open Publication
                                           Insert                                         Check




                                                                                                                                                Execution
       Reader                                            Experiments                    Executor                Data Ecosystem
                                      Data Ecosystem                                                                                            Simulation
                                                           Models                                               Query and Insert
                                           Query                                                                                                Analyses
                                                          Analyses


Figure 7: The main five roles inside our DE, and their main four activities.



common that it is necessary to define user roles and rules                 published in Zenodo4 to associate a DOI to it and enhance
with corresponding permissions over the data ecosystem                     accessibility and findability.
functionalities.                                                              Besides the publisher role, in our scenario, it was iden-
  In this scenario, it is assumed that the data ecosys-                    tified five user roles as follows. Figure 7 shows the five
tem is trustworthy in terms of privacy and security, and                   roles involved in four typical actions in the overall work-
any specific entity does not own it, but it belongs to the                 flow for the model development process. The actions
community.                                                                 represented are the experiments or models generation
                                                                           and insertion into the DE; the collection of data, such as
4.1. Roles                                                                 analyses, experiments, simulations, and models, together
                                                                           with the creation of simulation and analysis jobs.
Several organizations collaborate within a data ecosys-
tem. An organization is an abstract concept that groups                    Experimentalist This role identifies a scientist that
several people. Sometimes it is possible to map this con-                  carries out the experiment and generates the experimen-
cept to other familiar entities such as a university, a re-                tal data. The experimentalist has the intellectual property
search center, a department, or a research group. Each                     of data. Based on the situation, the experimentalist can
user belongs to at least one organization to be part of                    decide to immediately publish the results in a journal
our data ecosystem and has at least one role. The (vir-                    (or similar) or provide the data directly to other entities
tual) ownership of the data belongs to the organizations.                  through a private communication and publish them later.
Data entered or generated by a user will be owned by                       Accordingly to this choice, the experiments have an open
the organization to which it belongs, while the paternity                  or closed content policy, respectively. Even if a journal
of the data remains to him/her. The users must specify                     is not open access or requires a subscription, its exper-
whether the data deriving from them are open or closed                     iments are considered open content because they are
content. Each user can access all the open-content data                    publicly available material.
of all organizations inside the data ecosystem, and all the
closed-content data belonging to their organization(s).
                                                                           Researcher The researcher has mainly two function-
   The configuration in organizations allows an easy
                                                                           alities in our DE and scenario. First, it generates the
share of closed-content resources among them with dif-
                                                                           predictive model, and, as in the case of the experimen-
ferent levels of granularity and relationship: a single
                                                                           talist, it has the faculty to choose the publication policy.
experiment or a group of them could be shared with an-
                                                                           Second, it has the duty to verify the experiments in their
other organization, or an organization can share in one
                                                                           validation procedure as described before. Suppose the ex-
direction or both directions the whole closed-content
                                                                           periment that has to be validated is open-content. In that
data.
                                                                           case, a cross-validation strategy is preferred: a researcher
   The data ecosystem holds the role of publisher: as soon
                                                                           from a different organization of the experiment own-
as a content item is made open, the DE generates, in the
case of experiments, an XML representation file that is                        4
                                                                                   https://zenodo.org/




                                                                          24
ership will perform the task to avoid possible bias and         us to increasingly add complexity to the final ecosys-
enhance the DE’s overall trustworthiness. It is assumed         tem’s design and deal with new requirements arising
that there is at least one researcher per organization.         from a non-typical application domain more smoothly.
                                                                In addition to the typical challenges, a chemical engi-
Reader The reader represents the user that has per-             neering data ecosystem has to deal with a specific type
mission to access the open contents and all the closed          of data, such as predictive models and experimental and
contents belonging to its organization. Thanks to the           simulated data, that require ad-hoc methodologies, for
authentication, transparently, it is possible to hide part of   example, in the case of data quality measurements or
experiments, models, simulations, and analyses without          intellectual property management. Some of these aspects
changing the API.                                               are distinctive of scientific repositories, while the three-
                                                                phase approach and some challenges and solutions are
Writer The writer is a trained user that has the task           more universal. The prototype phase, in particular, is
to insert into the DE all the collected data. It is a trained   important to collect the requirements arising from a new
user because, for this field, it is not a straightforward       and complex domain with the final goal of discovering
operation and it requires basic domain knowledge, even          the main types of data that need to be stored and the
if the system and the researcher will check their validity      necessary services. The result of this step is the database
later. The writers mainly insert experiments and models.        and system architecture. A micro-service structure is a
They can find these data in the literature, or they can be      convenient architecture since, during a bottom-up ap-
provided through private communication. In any case,            proach, it is very probable that new requirements will
it is their responsibility to associate the correct content     arise. Implementing a new service will be a combina-
policy to objects.                                              tion of the existing ones. The framework step addresses
                                                                the challenges of transforming a proof-of-concept into a
                                                                system used daily by a restricted number of users. There-
Executor This role represents a kind of user that has
                                                                fore, this system version accounts for data quality and
the privilege to allocate resources and generate new data
                                                                management aspects, implements FAIR principles, and
in terms of simulations and analyses. In both cases, the
                                                                has to be scalable in terms of computational resources.
executor needs to have access to both experiments and
                                                                The final evolution deals with distinguishing the user
models to create a new simulation or perform analyses
                                                                roles inside the DE and data ownership. In such a way, it
(like in the case when it is needed to compare experi-
                                                                is guaranteed higher trustworthiness and transparency
ments against simulations). This kind of operation could
                                                                of the system and of the data while fulfilling the intel-
result in expensive operations. Also, in this case, domain
                                                                lectual property requests. Future developments concern
experience is required, for example, to set the optimal nu-
                                                                the improvement of the implementation of some FAIR
merical configuration to solve a simulation numerically
                                                                principles, in particular findability and reusability. We
and thus use the computational and storage resources
                                                                plan to introduce new features to allow from outside to
wisely.
                                                                make searchable experiments with a restricted access pol-
    It is worth mentioning that even if an experiment is
                                                                icy due to their intellectual property. Exposing just the
closed-content and the user has not the permissions, its
                                                                metadata could enhance both the findability and reusabil-
metadata, i.e., in this domain, the experimental condition,
                                                                ity of the experiments. In addition, we plan to present a
is in any case open, and therefore it is possible to simulate
                                                                provenance data model to improve the reusability of the
this configuration. Nevertheless, all the analysis opera-
                                                                analyses and the models, following the W3C recommen-
tions concerning comparing the simulated data against
                                                                dations.
the experimental data will be hidden.


5. Discussion and Conclusion                                     References
                                                                 [1] M. Jarke, B. Otto, S. Ram, Data sovereignty and data
In this paper, it was presented our experience in devel-
                                                                     space ecosystems, Business & Information Systems
oping a data ecosystem to improve the development pro-
                                                                     Engineering 61 (2019) 549–550.
cess of a chemical-physical predictive model. As hap-
                                                                 [2] E. Curry, A. Sheth, Next-generation smart environ-
pened often in practice, our design process of the data
                                                                     ments: From system of systems to data ecosystems,
ecosystem was a bottom-up approach rather than a top-
                                                                     IEEE Intelligent Systems 33 (2018) 69–76.
down due to the necessity of delivering a usable product
                                                                 [3] V. Stodden, The data science life cycle: a disciplined
quickly. The development of the final system foresees
                                                                     approach to advancing data science as a science,
three product-related phases: prototype, framework, and
                                                                     Communications of the ACM 63 (2020) 58–66.
data ecosystem. In each step, some properties of the final
                                                                 [4] C. Tenopir, E. D. Dalton, S. Allard, M. Frame, I. Pje-
data ecosystem are taken care of. This approach allowed




                                                            25
     sivac, B. Birch, D. Pollock, K. Dorsett, Changes         Management Information Systems 12 (1996) 5–33.
     in data sharing and data reuse practices and per- [17] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg,
     ceptions among scientists worldwide, PloS one 10         G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W.
     (2015) e0134826.                                         Boiten, L. B. da Silva Santos, P. E. Bourne, et al.,
 [5] C. Cappiello, A. Gal, M. Jarke, J. Rehof, Data Ecosys-   The fair guiding principles for scientific data man-
     tems: Sovereign Data Exchange among Organiza-            agement and stewardship, Scientific data 3 (2016)
     tions (Dagstuhl Seminar 19391), Dagstuhl Reports 9       1–9.
     (2020) 66–134. URL: https://drops.dagstuhl.de/opus/ [18] H. Koers, D. Bangert, E. Hermans, R. van Horik,
     volltexte/2020/11845. doi:10.4230/DagRep.9.9.            M. de Jong, M. Mokrane, Recommendations for
     66.                                                      services in a fair data ecosystem, Patterns 1 (2020)
 [6] F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar,   100058.
     H. Ibrahim, A. Mustapha, Data quality: A survey
     of data quality dimensions, in: 2012 International
     Conference on Information Retrieval & Knowledge
     Management, IEEE, 2012.
 [7] E. Ramalli, B. Pernici, Know your experiments:
     interpreting categories of experimental data and
     their coverage, in: SeaData at VLDB 2021, CEUR
     Workshop Proceedings, 2021, pp. 27–33.
 [8] I. N. Chengalur-Smith, D. P. Ballou, H. L. Pazer,
     The impact of data quality information on decision
     making: an exploratory analysis, IEEE Transactions
     on Knowledge and Data Engineering 11 (1999) 853–
     864.
 [9] M. Drosou, H. Jagadish, E. Pitoura, J. Stoyanovich,
     Diversity in big data: A review, Big data 5 (2017)
     73–84.
[10] S. Geisler, M.-E. Vidal, C. Cappiello, B. F. Lóscio,
     A. Gal, M. Jarke, M. Lenzerini, P. Missier, B. Otto,
     E. Paja, B. Pernici, J. Rehof, Knowledge-driven
     data ecosystems: Towards data transparency, ACM
     Journal of Data and Information Quality 14 (2022)
     1–12.
[11] M. Frenklach, Process informatics for combustion
     chemistry, in: 31-th International Symposium on
     Combustion, Heidelberg, 2006.
[12] T. Varga, T. Turányi, E. Czinki, T. Furtenbacher,
     A. Császár, Respecth: a joint reaction kinetics, spec-
     troscopy, and thermochemistry information system,
     in: Proceedings of the 7th European Combustion
     Meeting, volume 30, Citeseer, 2015, pp. 1–5.
[13] C. Batini, M. Scannapieco, Data and information
     quality: Dimensions, Principles and Techniques,
     Springer, 2016.
[14] G. Scalia, M. Pelucchi, A. Stagni, A. Cuoci, T. Far-
     avelli, B. Pernici, Towards a scientific data frame-
     work to support scientific model development, Data
     Science 2 (2019) 245–273.
[15] E. Ramalli, G. Scalia, B. Pernici, A. Stagni, A. Cuoci,
     T. Faravelli, Data ecosystems for scientific experi-
     ments: managing combustion experiments and sim-
     ulation analyses in chemical engineering, Frontiers
     in Big Data (2021) 67.
[16] R. Y. Wang, D. M. Strong, Beyond accuracy: What
     data quality means to data consumers, Journal of




                                                       26