From a prototype to a data ecosystem for experimental data and predictive models Edoardo Ramalli1,* , Barbara Pernici1 1 Politecnico di Milano, Via Giuseppe Ponzio, 34/5, Milan, 20133, Italy Abstract Data ecosystems have been a game-changer in many industrial applications and research fields, speeding up their development. The possibility of collecting large amounts of data within the same environment has also raised some common questions to all application domains, including the quality of the data collected and their reliability and trustworthiness. From experience gained collaborating with the chemical engineering field, this paper raises some discussion points related to the management of experimental data and predictive models within a data ecosystem. In fact, this type of data poses new requirements that require specific treatment before being implemented in a traditional data ecosystem. 1. Introduction ity [6] or database diversity tools [7] are fundamental to building reliable predictive models. Data quality has Data ecosystems (DE), in the last years, have shown their been proven that has a direct impact on decision-making potential in boosting the research and the industry, hold- activities [8], while database diversity could also have ing a central role in many definitions of industry 4.0 [1]. relevant social implications in some domains due to the DEs facilitates and encourages data sharing while extract- bias presented in the dataset [9]. DEs are protagonists ing knowledge and enhancing the comprehension of a also in other aspects: making data and services converge phenomenon [2]. In some cases, the data management in the same system can help increase their use and trust- features of a DE are even fundamental and a prerequisite worthiness. More data are collected inside a DE, and to applying data science in a big data context [3]. In addi- more users are attracted, whom themselves bring more tion, DEs lend themselves well to the ongoing scholarly data. The more active users are in DE, the more the data trends of data reuse [4]. In any case, DEs raise many and services are checked and used, and the more reliable challenges that need to be addressed and tailored based the data and the overall system are. Therefore, having on the domain [5]. data and tools in the same system is a positive vicious A possible application of such an information system circle, even if starting could be very challenging. is to use it as a collection of tools, scientific reposito- This work presents the experience in designing and ries, and services to improve the development process implementing a data ecosystem to enhance the develop- of predictive models for physical-chemical phenomena. ment process of predictive models in the field of chemical The development of these data-driven models relies on engineering. This DE needs to manage predictive mod- a manually managed data set. A model computes simu- els, analysis results, experimental and simulated data to lated data (or simulations) that are then manually vali- extract insights automatically while trying to address dated against the corresponding experimental data (or the typical challenges a data ecosystem faces during its experiments). A DE in this field represents a possible design, such as transparency and trustworthiness [10]. game-changer for several reasons. The need for DEs for storing experimental data and First, the number of available experimental data is tiny tools in the chemical engineering domain has emerged when compared to other data-intensive application, even in the last few years. First attempts to integrate data if is growing in the last years. Experiments are expensive together with analysis tools were made over time in the and time-consuming, while running simulations are com- PriMe repository [11], where some tools were provided putationally heavy. Therefore, sharing and reusing data in addition to data, and the need for being able to ana- is a primary objective of the scientific community and lyze the data production process and quality of data first one of the principal purposes of employing a data ecosys- emerged. Other repositories storing both experimental tem in this domain. As in many data-driven applications, data and tools also include systems such as ChemKED1 “you are what you eat,” and concepts such as data qual- or ReSpeCTh [12]. However, there is a lack to support for an approach in the design of simulation models as Proc. of the First International Workshop on Data Ecosystems (DEco’22), a process involving all the phases, from experimental September 5, 2022, Sydney, Australia data collection to simulation results analysis. This limi- $ edoardo.ramalli@polimi.it (E. Ramalli); tation has brought or the abandonment or the creation barbara.pernici@polimi.it (B. Pernici) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 http://www.chemked.com/ 18 of many alternative frameworks or software focusing on Finally, in the last stage of the project, it was removed specific aspects (e.g., CloudFlame2 for flames data and the constraint relaxation about the fact that only a small simulations) that are challenging to work together. number of people will use the framework, all belonging This paper discusses the emerging directions derived to the same research group, and de-facto transforming from the design and use of a prototype system for such our framework into a data ecosystem. purposes. Even if the features of a DE are well defined, The paper is structured as follows. In Section 2, the implementing and tailoring them in a particular domain prototype stage of our process is introduced, also present- and application has its unique challenges. For instance, ing the main types of data that will be stored in the DE. scientific repositories have well-known problems with In Section 3, it is illustrated the framework version of the data quality [13]. The biggest challenge concerns the project, where design and implementation choices are design method for our data ecosystem. A top-down strat- made to fulfill the typical characteristics of DE. Section 4 egy requires much time in the design phase, and often shows the challenges and consequences of implementing consumers are not willing to wait, even if it is the best a DE considering intellectual property data in a collabo- approach to saving time readjusting or adding new fea- rative environment. Finally, the data ecosystem’s open tures. On the other side, a bottom-up approach allowed challenges and future developments are discussed in Sec- us to deliver a product faster, even if several iterations of tion 5. feedback-adjustment were required. Nevertheless, this procedure highlighted some requirements that would hardly have emerged with a top-down approach, given 2. Prototype the complexity of the application domain. In the first phase of the project, the requirements were In any case, four phases were primarily identified dur- gathered and discussed continuously with the domain ex- ing this project, as shown in Figure 1. In each phase, perts (our stakeholders). At the end of the requirements even if some features are not immediately needed in the collection phase, it is essential to design properly the ar- current product delivery, some design decisions were chitecture and the technology necessary to implement an made keeping in mind the final goal of delivering a data information system suitable to meet the discussed needs. ecosystem. Therefore, this paper presents the challenges The resulting product of this phase is a simple prototype and design decisions in each phase toward developing a to check if the initial requirements are fulfilled and collect data ecosystem for a specific application domain. new ones. However, it is already necessary to structure the system to be compatible with the final architecture of Start Data a data ecosystem, even if some of these features are not Prototype Framework Project Ecosystem strictly necessary for this step. A detailed description of the design decisions in this phase is reported in [14]. In a DE for the development of predictive models, it is a game-changer to gather together experimental data, Figure 1: The four stages of our project in the development models, and analysis tools in the same system. These of a DE in the chemical engineering field. entities define what type of data the final DE should manage: experimental data (experiment), simulated data (simulation), models, and, eventually, analysis results. After the kick-off of the project and the requirement From an architectural and implementation perspective, collection, it was delivered the first prototype [14] in to guarantee maintainability and extensibility over time, which the main characteristics are the creation of a repos- it is preferred to choose a micro-service architecture that itory, with the proper database schema to collect the data, provides a few simple services, together with a relational the architectural structures of various components of the database to store experiments, models, simulations, and system together with the technological choices. analyses. Then the user can request and combine them The second phase regards the framework creation [15]. as preferred through an HTTP API, hence separating the If the primary purpose of the prototype phase is to collect front-end from the back-end. feedback from the end-user, with the framework, the need was to deliver a product that can be used daily Experimental Data Experiments are actual experi- by a single research group. This requirement implicitly mental measurements about the investigation of a par- suggests that a series of features are needed to ensure ticular environmental condition. An experiment is, in good data quality of the database, fault-tolerant features, fact, correlated with other metadata that characterize, for usability, accessibility, authentication, interoperability, example, the experiment author, the methodology, and and so on. the experimental conditions. These metadata contain a 2 series of information essential to classify the experiments https://cloudflame.kaust.edu.sa/ 19 Experimental Data Journal Simulated Data Author Equipment Year Experiment Condition Paper Subject Others Figure 3: Example of experimental and corresponding simu- lated data. Simulated data are theoretically a continuous set of points that the predictive model can compute. In practice, Figure 2: Experimental data metamodel simulating a data point could be very expensive. correctly. In fact, while in this area there is a progressive of developed models has increased in the last few years. propensity for sharing and greater availability of data, Nevertheless, not all the models can predict all the envi- nevertheless it remains a sector in which the order of ronmental conditions of a domain. Therefore, the meta- magnitude of existing data is much lower than in other data associated with (but not only) a predictive model is areas such as, for instance, that of social media; there- fundamental to study the behavior of a predictive model. fore, it is essential to collect and correctly catalog the The reasons behind the different capabilities of the pre- experiments in order to encourage their reuse. The do- dictive models vary but are mainly due to computational main experts defined which metadata are mandatory and expensiveness: what is known as a “detailed model” is a which ones are optional. Thus the relational database complete model and can predict the behavior of a domain schema was designed accordingly. An abstract represen- in many different conditions, but it takes a long time to tation of the experimental data metamodel is provided execute since it has to solve many differential equations. in Fig. 2. The analysis tools will leverage this metadata For this reason, simplified models are derived from the to understand the predictive behavior of the model in detailed ones with the cost of shrinking the prediction specific conditions. accuracy and reducing the capability to operate and pre- In our scenario, the primary source of experiments is dict all the possible conditions of the domain. As in the journal papers. Inside a paper, usually, there are multiple case of the experiments, the domain experts define the plots (an example in Figure 3) or data tables about the mandatory metadata for a model (Figure 4). measurements, where the corresponding metadata are not uniquely tabulated but are described as narrative in Simulated Data Simulations connect experiments to the text. Recently, the tendency is to share the numeri- models. Given a model and a numerical solver, it is pos- cal data of the experiments in the supplement material sible to simulate an experiment specifying the experi- associated to papers, facilitating the data collection. In mental condition to the solver, thus generating the cor- some cases, a representation of the experimental data responding simulated data. These generated data are and metadata is already available in a commonly used fundamental to performing different types of analysis on format in the domain, such as the XML ReSpeCTh format the experiments and on the model. For example, model adopted in [12], and it is available with a DOI associated validation is one of the most critical phases in the model to it. Metadata for the published papers are extracted development process. In this procedure, the model per- from Scopus retrieving citation data using the search formance is evaluated by comparing the similarity of the APIs3 . experimental data with the corresponding simulated data, as in Figure 3, generating one possible type of analysis Model Predictive models are treated as black boxes data. that, if provided to a numerical solver, can predict a par- ticular domain setting. Thanks to the increasing avail- ability of data and computational resources, the number 3 https://dev.elsevier.com/ 20 tion engines. Since the possible formats are few in our Author case study and there is no prevalent representation for- mat. This strategy was the best trade-off. All data inside the framework are only stored in the relational database, following the schema defined by the experts without be- ing bound to use particular formats. In order to feed Version Model Condition and collect data from the framework, we need transla- tion engines for every required representation format. Similarly, each numerical solver accepts a configuration file for each simulation and produces an output file in a specific format. Also in this case, the use of translation Others engines allows to be independent of the representation format of the data. Figure 4: Metamodel of the ’model’ data. 3.2. Data management Our data ecosystem has been designed to gather in the 3. Framework same system, models, experiments, and simulations. Thanks to this structure, as shown in Figure 5, the framework Until now, the prototype was a proof of concept of what acts as a man-in-the-middle that manages and shares the can be achieved, and once it was delivered, new require- knowledge between the four entities to generate new ments and discussions arised from the final user. In ad- knowledge. dition, with the switch to the framework version, new The downside of this conceptual architecture is that challenges related to day-by-day use needed to be prop- the entities are strongly connected, and incorrect data erly addressed. could quickly impact others. For this reason, inside the First, the framework should manage and automate the framework, it is introduced the concept of ownership of entire life cycle of the data correctly, from their insertion data to contain this hazard. In this way, it is possible, to the exchange, with all associated implications such once identified, to quickly identify all the erroneous data as data errors and different representation formats of involved. Services are provided in the framework for the the data. Second, it is critical to integrate analysis tools analysis of data quality and for comparing the results of to extract knowledge from the data. As before, the de- simulations with experimental data, as described in the sign and implementation of the new features have to be following sections. In addition, data management opera- done, keeping in mind that the final goal is to create a tions on experimental data are provided to improve the data ecosystem for experiments and predictive models. quality of the stored experimental basis in the repository. A detailed description of the framework is provided in This concept will be particularly helpful in the design of [15]. This section focuses on the most important aspects the roles in a data ecosystem as described in Section 4, related to the development of a data ecosystem as a final and therefore regulate access to data. goal of the project emerged during this phase. 3.3. Data quality 3.1. Data integration and exchange Nowadays, predictive models are increasingly data-driven, In some domains, experimental and simulated data could even in domains where a description with physical laws be expensive to generate or replicate. As a result, the of the phenomena is available. For this reason, data qual- data are accumulated over decades (in our domain, some ity plays a more and more central role in the model devel- of them are from the late 40’s of the past century), wit- opment process since it directly impacts the prediction nessing an evolution of the representation formats over quality. In addition, ensuring certain data quality levels the years. Even in the last years, with the digitalization within the DE enhances the system trustworthiness, thus of the data, commonly agreed representation formats can starting a loop of increasing the number of users as a be challenging to develop since it is rare to witness a consequence of the increased amount of collected data perfect agreement within the scientific community about and vice versa. what is mandatory to represent. In our domain, following the concept of fitness for use Interoperability is a fundamental prerequisite for a [16], three quality dimensions have been identified: com- data ecosystem, and for this reason, the strategy that pleteness, consistency, and accuracy. Timeliness is not reconciles the use of many representation formats, thus of interest in the context of experiments and simulations, collecting as much data as possible, is to employ transla- even if it is often used as a quality metric, mainly for two 21 reasons: first, even if older experiments are carried out with older and less precise instruments, they still repre- Analysis sent a valuable source of information, and their impreci- sion should be included in their uncertainty evaluation, Tools which it “just” needs to be handled correctly. Second, since the experiments are expensive and hence rare, it is pretty unlikely that multiple experiments are carried out Experiment Exploration Exploitation Model in exactly the same conditions, thus “updating” the old values. For a similar reason, since the predictive models are deterministic, the simulated data does not change over time if forecast with the same model, and numerical configuration of the solver. Simulation In the framework, the data quality control process is composed of two parts, one automatic and the other manual, where the automatic control is performed right Figure 5: The data ecosystem, with its tools, acts as a man- after the insertion of a new data in the repository and in-the-middle between the four types of data. not, for example, a posteriori based on a recurrent sched- ule. Data that does not reach the minimum data quality requirements are immediately rejected. In such a way, it is possible to create a de-facto ground As in all the data quality applications, the rules to truth against which can assess if an experiment is plau- measure the data quality dimension depend on the do- sible or wrong. However, this is not always true: if the main, and, often, they are also implementable as auto- experimental data are very different from the simulated matic checks. Regarding completeness, thanks to the data, then this is only a hint of a possible error but not a domain knowledge provided by the experts, it is possible certainty. Therefore this automatic approach is combined to know which metadata is mandatory or optional and in with a manual validation of the experimental data by an which conditions. For example, it is usually compulsory expert. to express the unit of measurement and the name of the measured property, but for some properties’ values, the unit has not been expressed since they are adimensional. 3.4. Data FAIRness Consistency works in a similar way: it is checked that FAIR (Findable, Accessible, Interoperable, and Reusable) properties of the same instance are consistent with each [17] data have shown to bring many benefits to data other. A typical example is an accordance between the ecosystems. In this section, following the recommenda- property name, like “pressure,” and a plausible unit of tion from the literature [18], it is presented appropriate measurements such as “bar” or “pascal.” Finally, the ac- functionalities for each principle of the FAIR policies that curacy of the data is considered. It is well known that have been implemented or designed for the experimental estimating accuracy is by far the most challenging data data inside our data ecosystem. quality dimension, but in a framework where experi- ments and models are combined, it has a non-negligible Findable Experiments are stored and used inside the advantage. data ecosystem through a relational database that is very In Figure 5, the typical relation between the experi- flexible and easily maintainable if compared to a file- ments and the model is shown: during the model valida- based organization of the experimental data. Neverthe- tion procedure, the experiments are used to quantify the less, a database representation of the experiments is not predictive model performance. However, since the model findable, and, for this reason, for each experiment, we cre- is obviously not perfect, it has an (epistemic) uncertainty, ate an XML representation of the experiment following but it is reliable enough in many different conditions so an XML schema that is widely accepted in the scientific that it can be used to check if most of the information community of experiment’s domain. The file is then au- inside an experiment is meaningful. In fact, both the tomatically uploaded to Zenodo to assign to it a unique accuracy of the numerical data and the metadata could global identifier together with other metadata that make be tested. If the predictions differ significantly from the the experiment searchable without necessarily using our simulated ones, this discrepancy suggests an error in the data ecosystem. reported measurements or in the metadata used to set the simulation. In other words, the model can validate the Accessible Experiments inside our data ecosystem are experiments. This approach foresees cross-validation of identified both with a (numerical) primary key and the an experiment against multiple simulated data about the associated DOI. A primary numerical key makes imple- same experimental condition but using different models. 22 menting the relational instances in the database easier 2 4 even before the DOI has been generated. Our data ecosys- 3 tem offers data management services through a HTTP 6 5 Computational API, accepting typical formats of the request such as CSV, Data Ecosystem Worker Resources (Coordinator) JSON, and XML. One of the advantages of such HTTP API micro-services structures is that the final users are 1. Create Job 1 2. Ask for a Job not requested to use a particular software or program- 3. Get Job related Data ming language or technical expertise to access data and Status 4. Execute Job services, and they can combine them as preferred. Au- Check 5. Collect Job Results thentication is required to use the API upon a free sign-in Executor 6. Forward Job Results request procedure. Authentication enables traceability and accountability of the operations and helps keep a Figure 6: Coordinator-worker architecture. quality level of the scientific repository with respect to an open-access configuration. Interoperable Experiments in their XML represen- system the more they are inclined to share data, thus im- tation format are a plug-and-play solution. Every re- proving the overall system and starting a virtuous circle. searcher can use them as preferred, paying attention to Such tools generate new knowledge about the data or the the definition of each XML tag. If the experiments are domain and increase the awareness and insight on data. accessed through the HTTP API, the same vocabulary In fact, it is central, for example, the concept of database of the XML representation format is used to query the coverage or diversity. In all data-driven models, “you database and for the responses. are what you eat,” and therefore if a model is generated only using data that represent a restricted portion of a Reusable One of the primary purposes of the data domain, the model will be able to more or less correctly ecosystem is to reuse data, encourage their sharing among predict only what it has already seen. Drawbacks of such institutions and avoid duplicates. Experimental data can an approach could lead to ethical problems since classifi- be uniquely cataloged through some metadata. Devel- cation, and regression models could have strong biases oping the database around the uniqueness constraint of based on the diversity and the balance of the data used to these metadata allows us to maximize the reuse. generate them. A predictive model for physical domains suffers from the same hazard: data are mainly used for the validation phase. If the model is validated against a 3.5. Data generation and analysis large amount of data but not diverse, the predictive model Thanks to the model, we can theoretically generate an performances could be astonishing, but in practice, they infinite number of simulated data, and similarly, using could be much worst. the analysis tools and combining them as we prefer, we can create a vast number of analysis data. Neglecting the space needed to store such quantities of data, the first 4. Data ecosystem limitation that makes this idea unfeasible is the amount The final stage of this evolution regards the transition of computational resources needed to generate them. A from a framework to a data ecosystem. In this last evo- centralized architecture where all the computational bur- lution stage, what is important to investigate is how the den is on a single organization is not sustainable. Even if framework that has been actively used by one research the cost is shared, the bureaucracy behind sharing com- group should evolve to host multiple organizations and putational resources is very complicated. The solution to many more users. This transition that seems straightfor- this problem is a coordinator-worker architecture where ward in practice has mainly two different challenges that the framework, i.e., the coordinator, collects the jobs and can be smoothly implemented thanks to the designed distributes them among the workers, that in some cases choices of the previous project steps. First, activities for can delegate the job to other machines as shown in Fig- the repository management described before, such as ure 6. The coordinator-worker configuration is scalable experiment validation, need to be formalized in terms and allows each user to decide how many computational of responsibility and accountability. Second, the data resources to dedicate and use only for their jobs. ecosystem could host data with intellectual properties Providing analysis tools inside the framework is a (IP) that are not yet open access but are on the data ecosys- game-changer. The user is incentivized to stay in the tem because the final user wants to take advantage of system and leverage the other knowledge in terms of our functionalities and analysis tools to compare, for ex- data and tools available. The more the users stay in the ample, the quality of data. Both these challenges have in 23 Experiment / Model Open Publication Generation Open Content Data Flow Closed Content Experimentalist Control Flow Researcher Authorized Access Experiment / Model Repository Writer Data Ecosystem DQ Check Researcher Policy Collection Open Publication Insert Check Execution Reader Experiments Executor Data Ecosystem Data Ecosystem Simulation Models Query and Insert Query Analyses Analyses Figure 7: The main five roles inside our DE, and their main four activities. common that it is necessary to define user roles and rules published in Zenodo4 to associate a DOI to it and enhance with corresponding permissions over the data ecosystem accessibility and findability. functionalities. Besides the publisher role, in our scenario, it was iden- In this scenario, it is assumed that the data ecosys- tified five user roles as follows. Figure 7 shows the five tem is trustworthy in terms of privacy and security, and roles involved in four typical actions in the overall work- any specific entity does not own it, but it belongs to the flow for the model development process. The actions community. represented are the experiments or models generation and insertion into the DE; the collection of data, such as 4.1. Roles analyses, experiments, simulations, and models, together with the creation of simulation and analysis jobs. Several organizations collaborate within a data ecosys- tem. An organization is an abstract concept that groups Experimentalist This role identifies a scientist that several people. Sometimes it is possible to map this con- carries out the experiment and generates the experimen- cept to other familiar entities such as a university, a re- tal data. The experimentalist has the intellectual property search center, a department, or a research group. Each of data. Based on the situation, the experimentalist can user belongs to at least one organization to be part of decide to immediately publish the results in a journal our data ecosystem and has at least one role. The (vir- (or similar) or provide the data directly to other entities tual) ownership of the data belongs to the organizations. through a private communication and publish them later. Data entered or generated by a user will be owned by Accordingly to this choice, the experiments have an open the organization to which it belongs, while the paternity or closed content policy, respectively. Even if a journal of the data remains to him/her. The users must specify is not open access or requires a subscription, its exper- whether the data deriving from them are open or closed iments are considered open content because they are content. Each user can access all the open-content data publicly available material. of all organizations inside the data ecosystem, and all the closed-content data belonging to their organization(s). Researcher The researcher has mainly two function- The configuration in organizations allows an easy alities in our DE and scenario. First, it generates the share of closed-content resources among them with dif- predictive model, and, as in the case of the experimen- ferent levels of granularity and relationship: a single talist, it has the faculty to choose the publication policy. experiment or a group of them could be shared with an- Second, it has the duty to verify the experiments in their other organization, or an organization can share in one validation procedure as described before. Suppose the ex- direction or both directions the whole closed-content periment that has to be validated is open-content. In that data. case, a cross-validation strategy is preferred: a researcher The data ecosystem holds the role of publisher: as soon from a different organization of the experiment own- as a content item is made open, the DE generates, in the case of experiments, an XML representation file that is 4 https://zenodo.org/ 24 ership will perform the task to avoid possible bias and us to increasingly add complexity to the final ecosys- enhance the DE’s overall trustworthiness. It is assumed tem’s design and deal with new requirements arising that there is at least one researcher per organization. from a non-typical application domain more smoothly. In addition to the typical challenges, a chemical engi- Reader The reader represents the user that has per- neering data ecosystem has to deal with a specific type mission to access the open contents and all the closed of data, such as predictive models and experimental and contents belonging to its organization. Thanks to the simulated data, that require ad-hoc methodologies, for authentication, transparently, it is possible to hide part of example, in the case of data quality measurements or experiments, models, simulations, and analyses without intellectual property management. Some of these aspects changing the API. are distinctive of scientific repositories, while the three- phase approach and some challenges and solutions are Writer The writer is a trained user that has the task more universal. The prototype phase, in particular, is to insert into the DE all the collected data. It is a trained important to collect the requirements arising from a new user because, for this field, it is not a straightforward and complex domain with the final goal of discovering operation and it requires basic domain knowledge, even the main types of data that need to be stored and the if the system and the researcher will check their validity necessary services. The result of this step is the database later. The writers mainly insert experiments and models. and system architecture. A micro-service structure is a They can find these data in the literature, or they can be convenient architecture since, during a bottom-up ap- provided through private communication. In any case, proach, it is very probable that new requirements will it is their responsibility to associate the correct content arise. Implementing a new service will be a combina- policy to objects. tion of the existing ones. The framework step addresses the challenges of transforming a proof-of-concept into a system used daily by a restricted number of users. There- Executor This role represents a kind of user that has fore, this system version accounts for data quality and the privilege to allocate resources and generate new data management aspects, implements FAIR principles, and in terms of simulations and analyses. In both cases, the has to be scalable in terms of computational resources. executor needs to have access to both experiments and The final evolution deals with distinguishing the user models to create a new simulation or perform analyses roles inside the DE and data ownership. In such a way, it (like in the case when it is needed to compare experi- is guaranteed higher trustworthiness and transparency ments against simulations). This kind of operation could of the system and of the data while fulfilling the intel- result in expensive operations. Also, in this case, domain lectual property requests. Future developments concern experience is required, for example, to set the optimal nu- the improvement of the implementation of some FAIR merical configuration to solve a simulation numerically principles, in particular findability and reusability. We and thus use the computational and storage resources plan to introduce new features to allow from outside to wisely. make searchable experiments with a restricted access pol- It is worth mentioning that even if an experiment is icy due to their intellectual property. Exposing just the closed-content and the user has not the permissions, its metadata could enhance both the findability and reusabil- metadata, i.e., in this domain, the experimental condition, ity of the experiments. In addition, we plan to present a is in any case open, and therefore it is possible to simulate provenance data model to improve the reusability of the this configuration. Nevertheless, all the analysis opera- analyses and the models, following the W3C recommen- tions concerning comparing the simulated data against dations. the experimental data will be hidden. 5. Discussion and Conclusion References [1] M. Jarke, B. Otto, S. Ram, Data sovereignty and data In this paper, it was presented our experience in devel- space ecosystems, Business & Information Systems oping a data ecosystem to improve the development pro- Engineering 61 (2019) 549–550. cess of a chemical-physical predictive model. As hap- [2] E. Curry, A. Sheth, Next-generation smart environ- pened often in practice, our design process of the data ments: From system of systems to data ecosystems, ecosystem was a bottom-up approach rather than a top- IEEE Intelligent Systems 33 (2018) 69–76. down due to the necessity of delivering a usable product [3] V. Stodden, The data science life cycle: a disciplined quickly. The development of the final system foresees approach to advancing data science as a science, three product-related phases: prototype, framework, and Communications of the ACM 63 (2020) 58–66. data ecosystem. In each step, some properties of the final [4] C. Tenopir, E. D. Dalton, S. Allard, M. Frame, I. Pje- data ecosystem are taken care of. This approach allowed 25 sivac, B. Birch, D. Pollock, K. Dorsett, Changes Management Information Systems 12 (1996) 5–33. in data sharing and data reuse practices and per- [17] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, ceptions among scientists worldwide, PloS one 10 G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. (2015) e0134826. Boiten, L. B. da Silva Santos, P. E. Bourne, et al., [5] C. Cappiello, A. Gal, M. Jarke, J. Rehof, Data Ecosys- The fair guiding principles for scientific data man- tems: Sovereign Data Exchange among Organiza- agement and stewardship, Scientific data 3 (2016) tions (Dagstuhl Seminar 19391), Dagstuhl Reports 9 1–9. (2020) 66–134. URL: https://drops.dagstuhl.de/opus/ [18] H. Koers, D. Bangert, E. Hermans, R. van Horik, volltexte/2020/11845. doi:10.4230/DagRep.9.9. M. de Jong, M. Mokrane, Recommendations for 66. services in a fair data ecosystem, Patterns 1 (2020) [6] F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, 100058. H. Ibrahim, A. Mustapha, Data quality: A survey of data quality dimensions, in: 2012 International Conference on Information Retrieval & Knowledge Management, IEEE, 2012. [7] E. Ramalli, B. Pernici, Know your experiments: interpreting categories of experimental data and their coverage, in: SeaData at VLDB 2021, CEUR Workshop Proceedings, 2021, pp. 27–33. [8] I. N. Chengalur-Smith, D. P. Ballou, H. L. Pazer, The impact of data quality information on decision making: an exploratory analysis, IEEE Transactions on Knowledge and Data Engineering 11 (1999) 853– 864. [9] M. Drosou, H. Jagadish, E. Pitoura, J. Stoyanovich, Diversity in big data: A review, Big data 5 (2017) 73–84. [10] S. Geisler, M.-E. Vidal, C. Cappiello, B. F. Lóscio, A. Gal, M. Jarke, M. Lenzerini, P. Missier, B. Otto, E. Paja, B. Pernici, J. Rehof, Knowledge-driven data ecosystems: Towards data transparency, ACM Journal of Data and Information Quality 14 (2022) 1–12. [11] M. Frenklach, Process informatics for combustion chemistry, in: 31-th International Symposium on Combustion, Heidelberg, 2006. [12] T. Varga, T. Turányi, E. Czinki, T. Furtenbacher, A. Császár, Respecth: a joint reaction kinetics, spec- troscopy, and thermochemistry information system, in: Proceedings of the 7th European Combustion Meeting, volume 30, Citeseer, 2015, pp. 1–5. [13] C. Batini, M. Scannapieco, Data and information quality: Dimensions, Principles and Techniques, Springer, 2016. [14] G. Scalia, M. Pelucchi, A. Stagni, A. Cuoci, T. Far- avelli, B. Pernici, Towards a scientific data frame- work to support scientific model development, Data Science 2 (2019) 245–273. [15] E. Ramalli, G. Scalia, B. Pernici, A. Stagni, A. Cuoci, T. Faravelli, Data ecosystems for scientific experi- ments: managing combustion experiments and sim- ulation analyses in chemical engineering, Frontiers in Big Data (2021) 67. [16] R. Y. Wang, D. M. Strong, Beyond accuracy: What data quality means to data consumers, Journal of 26