Know your experiments: interpreting categories of experimental data and their coverage Edoardo Ramalli Barbara Pernici Politecnico di Milano Politecnico di Milano Milan, Italy Milan, Italy edoardo.ramalli@polimi.it barbara.pernici@polimi.it ABSTRACT Metadata Experiment Data management in scientific domains is more important than ever ID Reactor Exp. Type T P Phi ... due to the increasing availability of experimental data. Automati- cally integrating and managing the information would significantly 12 PFR O.C.M 300k 1atm 0.5 ... speed up their reuse and, in particular, the development of predic- tive models for a given domain. However, the diversity, ambiguity, Experimental Data and complexity of experimental data make it hard in practice. In this Temperature 800 827 855 883 ... work, we propose a general approach to overcome these challenges, Concentration 2E-04 2E-04 3E-04 3E-03 ... combining a human-in-the-loop process with a new methodology Pressure 1.0 1.1 1.3 1.2 ... to understand automatically the semantics of experimental data, which can also be used as a data cleaning procedure. In addition, Simulated Data we focus on assessing the domain coverage of an experimental data- base using only categorical characteristics of the domain, which is Temperature 800 827 855 883 ... essential for model validation or to understand if and where there Concentration 0 0 0 4E-04 ... is a need to perform additional experiments. Rho 0.9 0.9 0.8 0.6 ... ... ... ... ... ... ... Reference Format: Edoardo Ramalli and Barbara Pernici. Know your experiments: interpreting categories of experimental data and their coverage. In the 2nd Experimental Data Simulated Data Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data 2021). 0.0100 [Mole Fraction] Concentration 0.0075 1 INTRODUCTION The collection of experimental data in many disciplines has pro- 0.0050 duced a massive amount of data over the decades. However, the quality of data and the collection methodologies have changed with 0.0025 the evolution of the research fields and the improvements in the technology used to carry out measurements. Over the years, this 0.0000 progression has led to the availability of considerable amounts of 800 900 1000 data, but that are likely affected by ambiguity problems due to their Temperature[K] heterogeneity and complexity. At the same time, the increasing availability of experimental Figure 1: In the plot, a simplified example of the experimen- data has stirred the development of predictive models to study a tal data of interest and the corresponding simulated data of domain and improve the related technologies. These data-driven an experiment. In the tables, the tabular data and metadata models are greedy of data, and, as a consequence, there is the of an experiment with the simulated data. need to automatically collect, store, and manage large quantities of information coming from different sources, representation formats, and different quality levels. Data ecosystems address these problems [27]. Nevertheless, their complexity often makes it hard to put these by integrating disparate or incompatible data sources, maintaining a ideas in practice. In fact, in many domains, the data management specific quality level [8]. As experimental data are a precious source system has an essential role, so that “the available data management of value, the FAIR principles encourage the reuse and sharing of data resources define what is discovered” [1], removing the differences, Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021 the ambiguity, and making the data usable. for the volume as a collection by its editors. This volume and its papers are published In our case study, combustion kinetics, as shown in a simplified under the Creative Commons License Attribution 4.0 International (CC BY 4.0). snapshot in Figure 1, we work with experimental data related to Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal- ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, an experiment with associated metadata and the corresponding Copenhagen, Denmark) on CEUR-WS.org. simulated data of a predictive model to compare them and validate and improve the model itself. The challenge is to overcome the for the case study of combustion kinetics, removing the semantic manual management of the data: we need to automatically interpret ambiguity of the data and provide services to analyze data and im- the experiment, i.e., to distinguish the actual subject data of the prove the predictive model. In particular, this system has to analyze experiment among all the data columns, simulate it with the right together multiple experimental data that compose a trend rather solver, and pair the experimental data with the simulated data than stand-alone information. correctly based on the content of the experiment metadata. Combustion kinetics has been the subject of study for many We propose an iterative approach to understand and store exper- decades. For this reason, many experimental data regarding differ- imental data with humans-in-the-loop, focusing on three aspects ent fuels in several environmental conditions have been collected that are critical in building scientific models: interpreting the scien- over the years. The evolution of the combustion study process has tific data, assessing the coverage of the experiments in a specific led to increasingly precise measurements, enriching the experi- domain, and clean and improve the scientific repository. To this mental data with outline details that are decisive for a complete purpose, we propose a rule-based interpretation of each experiment, understanding of the phenomenon. Over the last few decades, data that enable to automatically validate and clean the data using a sim- collection has been more methodical and massive, which has al- ilarity index. Furthermore, it is important to quantify the database lowed for the development of predictive numerical models. A nu- coverage within the experimental domain space. The experimental merical model can simulate complex domains without the necessity database coverage of a domain impacts the ability to assess the va- to carry out expensive experiments in terms of time and price. In lidity of numerical simulation models developed for a domain and, particular, in the case of combustion kinetics, we can predict the if used as a training set for machine learning models, it can have a behavior of reactors and fuels in different conditions to improve heavy impact on the quality of the resulting model [9]. We define a their efficiency and reduce pollutants. general index to quantify the coverage and the experiment’s density Even today, both the models and the experimental data are distribution by combining categorical attributes of the database mainly affected by two problems that make these data sources of schema and multidimensional matrices. heterogeneous information. The first problem regards uncertainty; The paper is structured as follows. In Section 2, we discuss related the latter concerns the ambiguity of the information contained work and open problems. A general approach to integrate different therein. Uncertainty can be related to the experimental imprecision data sources with semantic heterogeneity problems is introduced or the error made by the model representing the domain. These two in Section 3. Section 4.1 presents a rule-based approach to interpret types of uncertainty are thus defined as the aleatoric one, which the semantic of experimental data automatically and a methodology is related to the noise present in the data, and the epistemic one to quantify the coverage of an experimental database is presented in associated to what the model does not represent precisely [6]. Section 4.2. An automatic analysis of a numerical simulation model Similarly, ambiguities can be encountered both in the model and using experimental data that facilitates data cleaning procedures is in the experimental data. An example regards the chemical names. discussed in Section 4.3. A final discussion is debaded in Section 5. Many different fields deal with chemical compounds whose names are not uniquely defined, and for this reason, diverse nomenclatures of the same compound can be found both in different models and ex- 2 RELATED WORK perimental data. This obstacle involves a not immediate integration In recent years there has been growing attention to the sharing of experimental data from different sources and direct comparison and reuse of data [24]. Several projects have been developed, such of different models [14]. Another characteristic of this domain is as EOSC (European Open Science Cloud), focused on reusing, in- that it is hard to automatically understand the experiment subjects tegrating, and sharing data and services within the scientific com- among the various information contained in the data. munity. An example is Clowder [16], a framework that facilitates Regarding the uncertainty in the data and the model, techniques the development of a data management system, offering features have been developed to separate the two types of uncertainty [21], for visualization, annotating, and management. Although Crowder but it is not easy to estimate them if a ground truth is not available. has shown that the framework can be used in different domains, In combustion kinetics, it has been conventionally chosen to assume each domain has its own characteristics and requires specific im- arbitrary uncertainty values, if missing, in the case of specific types plementations that are difficult to generalize. Homer is an example of experiments and apparatuses [17]. of a system for managing experimental biological data [1]. The Different formats have been proposed to represent the combus- heterogeneity of the collected data, represented and managed over tion kinetics experimental data to remove the ambiguity from the time, defines what can be discovered and directly affects the quality experimental data. These formats contain mandatory or optional and quantity of research results. For this reason, there is a need to fields that limit the freedom of each researcher in defining their integrate and manage complex and heterogeneous scientific data fields, thus moving towards a standardization of representation. in a system capable of extracting value from them. The integration There are mainly two representation formats, ReSpecTh [25] and of experimental data from different sources is not an easy task: ChemKED [26], in combustion kinetics. correct use of metadata can provide the necessary knowledge for The diversity of a dataset is a critical aspect in many practi- preservation, access, and reuse of scientific data [10] and therefore cal applications, but it is often overlooked [7]. As a result, bias allows immediate support for the development of immediate appli- predictors can easily be obtained, which can also have severe reper- cations and long-term maintainability and accessibility of data. In cussions in everyday life [2]. The coverage of a database allows this context, there is the need of a data management system that us to understand how diverse a dataset is. Recent proposals allow offer services for integrating heterogeneous source of information quantifying the coverage of a database using recognition patterns 2 concerning categorical attributes [2], which can also be found on To tackle these problems, we define the process illustrated in different tables [15]. These approaches are based on the definition Figure 2, that follows the entire life cycle of experimental data to of patterns and thresholds. There is, therefore, a need to accurately guarantee a certain level of the data quality, according to different and precisely define both the patterns and thresholds. quality dimensions, and at the same time, provide information to Data is critical for the development of machine learning-based improve the predictive model. models. For this reason, data management has an increasingly cen- This human-in-the-loop process is implemented within SciEx- tral role in these activities as the results of the models are strictly peM (Scientific Experiments and Models), a framework that offers dependent on the dataset [18]. In more recent times, the focus has different services related to the management and analysis of ex- shifted towards the correct integration and quality of the data, and perimental and simulated data to speed up the predictive model for this reason, the reverse operation is carried out: the models are development process in combustion kinetics [19, 22]. We associate used iteratively to evaluate and improve the quality of the data [20]. to activities in the process additional metadata to assess the val- This data cleaning procedure can improve the starting dataset, pay- idation state of the experimental data, status, that denotes if an ing attention to maintain the convergence of the machine learning experiment is new in the database or if it is invalid or verified. model [13]. Other techniques of data cleaning rely on the defini- SciExpeM uses the process for different applications: first of all, tion of rules, on which, based on the result of the evaluation of a the user enters the experimental data in the system using, for exam- condition, a specific operation is performed [5]. ple, an interactive form. The experiment is added to the database and SciExpeM checks for syntax or detectable semantic errors. Ini- tially the new data are tagged as new and they can be set to invalid in any of the following phases if flaws in the data are detected. In a 3 SCIEXPEM second moment (activity Check experiment in the Figure 2), an ex- In many experimental disciplines, data is collected from different pert has to verify each new experiment, checking for undetectable sources such as repositories, literature, or private communication semantic errors and fill the incomplete experiment metadata. Once between research laboratories. This entails having to manage var- an experiment is verified, the status field change accordingly, and ious problems related to the heterogeneity of the data [11]. Fur- SciExpeM couples the experiment to an interpreter. Experimen- thermore, as in combustion kinetics, there is no uniquely accepted tal data and results of simulators are records of information that representation standard to convey this information. All this implies, we need to distinguish and pair automatically. To this purpose we even for the most recent data, different accuracy, completeness, and propose to associate to experimental data to the concept of an in- other data quality dimensions of the repository [23]. terpreter for the data. This entity, in particular, can recognize the Experimental data are precious both for their rarity and for their properties that are under investigation in an experiment from the cost in collecting them. For this reason, it is essential to accept all others that are just auxiliary information, such as environmental the experiments and then carry out a series of automatic checks to conditions. For example, in Figure 1, the pressure is neither the preserve the repository’s quality. For example, a possible control dependent nor the independent variable (or property) under inves- is on the consistency between the unit of measurement and the tigation, unlike temperature and concentration. Moreover, based on measured property. Another quality dimension to guarantee is com- the experiment details, the interpreter knows which solver needs to pleteness: Since the data comes from different sources, times, and be used to simulate it and correctly pair the experimental data with formats, it is essential to ensure that all the primary information the corresponding simulated ones. Finally, when the system can of an experiment, in terms of metadata, is complete. Regarding the manage an experiment independently with its simulations, a loop semantic accuracy of the experimental data, it is important that the starts. The simulated data are compared with the experimental data values of the properties are within a range of reasonableness. How- using a similarity index that provides information to improve both ever, while in the literature there has been an extensive attention to the model and the repository quality. This comparison is possible developing techniques for managing and ensuring data quality and because we leverage a bidirectional relationship: we use the model consistency (see for an extensive survey [3]), there are still many to validate the data and use the data to validate the model. First, open problems in understanding the quality of data in their context using the experimental data to validate the model helps understand- of use. In particular, in this paper we focus on using experimental ing which aspects or portions of the domain that an experiment data in simulation model development in general, in a context in represents still need to be improved. Second, we use the model to which the experimental error can be notoriously significant, but understand if the semantics of the experimental data is reliable: a it is not (or cannot) easily be quantified. In this context, the prob- model that differs strongly from the experimental data is synony- lem is the ability of identifying possible errors in the data and/or mous with an error in the model or incorrect experimental data. The in the models, in a joint validation effort based on a data-driven human-in-the-loop approach allows assessing these discrepancies approach. Finally, a crucial aspect for all data-driven applications and taking the appropriate actions. is automation. Otherwise, in the case of predictive model develop- ment, manually managing the simulations and validations of the experiments is a wasteful and error-prone task. The problem is to be able to provide a generic framework in order to be able to manage 4 EXPERIMENTS MANAGEMENT experiments easily and in a domain-independent way, associating Representing, collecting, and integrating heterogeneous data in a them with information needed for data-driven techniques, such as database are only the initial steps to extract value. In Section 4.1, simulations and predictions. we present our approach to interpreting the semantics of the data 3 Analyze Simulation Add Check Verify Interpret Simulate Experiment Experiment Experiment Experiment Experiment Improve Model and / or Data Cleaning Figure 2: A simplified schema of the experimental data process. correctly, and then in Section 4.2 we measure the coverage of a data- knows, for example, the correct relation of dependent-independent base in a given domain, while in Section 4.3 we focus on improving variable, or more in general, can separate the useful information the repository quality. from the secondary one, and if necessary, pair them. In order to associate an interpreter to an entry of the model E, we have to associate a set of rules, 𝑅 = {𝑟 1, ..., 𝑟𝑘 }, to an interpreter. These 4.1 Rule-based Automatic Interpretation rules 𝑟 are entries of another table in the database, rule, R, where Experiments are records of measured properties and other metadata each element specifies a name of the model 𝑁 , the attribute’s name that characterize them. Besides, among the measurements, it is not 𝐴 and value 𝑉 . A rule 𝑟 ∈ R is fulfilled by an entry 𝑒 ∈ E if 𝐴 is rare to find additional measured properties that specify, for example, an attribute for 𝑒 and the corresponding value of the attribute is the environmental conditions of the measures, but without being equal to 𝑉 . The model name 𝑁 is an optional field that, if defined, the subject of the scientific observation. This peculiarity generates specifies that the rule is not directly on an attribute of the model ambiguity since a property could be the subject in an experiment 𝑒, but it is related to an attribute of another model 𝑁 that has a but not in another. In practice, to manage scientific data, there is reference to the entry 𝑒. If an entry 𝑒 fulfill all the rules 𝑟 associated the need to distinguish automatically which, among the measured to an interpreter 𝑖 ∈ I, we can associate the interpreter 𝑖 to the properties, is the dependent and the independent variable. In this entry 𝑒. context, we need to teach the data management system the ability to recognize the role of each property in each experiment, keeping in mind that what makes a property a subject of an experiment is a 4.2 Database Coverage particular combination of metadata values of the experiment itself. The Model Validation procedure systematically measures how good For this reason, it is necessary to define a flexible methodology the predictions of a model are, compared to the corresponding to distinguish the subject properties from the auxiliary ones. In experimental data. To consider the result of this procedure reliable, other words, we need to find an approach to transfer the domain the experimental database, if possible, should cover as much as knowledge into the SciExpeM to interpret the semantics of an possible the domain with equal granularity. Database coverage can experiment correctly and treat all the database entries with equal help in this task, providing an immediate procedure to measure the semantics in the same way. diversity and completeness of representation of the database. Manual management of this complex database is not feasible We leverage categorical attributes and a multidimensional ma- because an experiment could contain dozens of measured properties, trix to represent the domain and to define a coverage index. This and, for example, we should tag each of them correctly if they are approach overcomes the limitations of using patterns and thresh- the subject of the experiment or not. Moreover, this procedure olds that are sensible and directly affect the measurements based should be repeated hundreds of times, once for each experiment, on the way they are defined. We create a detailed and generic repre- making it hard to analyze a large amount of data. Accordingly, we sentation of the database coverage that can be used to assess which propose a methodology automatically extracting useful information part of the domain is poorly covered by data and consequently can from a database model in which semantic heterogeneity is present. be used to start the process of Design of Experiments. We propose a dynamic interpretation of a database model based We measure the coverage C of dataset D that regards the model on rules, similar to what is done for data cleaning or to ensure con- M with 𝑛 attributes, 𝐴 = 𝐴1, ..., 𝐴𝑛 in three steps. sistency and accuracy in a database [5]. In Figure 3 we can see the First, it is necessary to identify a subset of the model fields (or class schema of the database schema that we use to implement the attributes) {𝐴1, ..., 𝐴𝑠 } = 𝐴ˆ ⊆ 𝐴, and transform them into categor- automatic interpretation of scientific experiments. Given a model ical attributes. A categorical attribute of the model is a field that Experiment (Exp.), E, that is an abstract representation of a model can only take a value from a restricted number of options. In this affected by ambiguity, we have to assign, for each entry 𝑒 ∈ E, an way, any attribute 𝐴𝑖 ∈ 𝐴ˆ can only have 𝑑𝐴𝑖 different ordered Interpreter entry of the model I. This model can save additional categorical values (or possible options). If the attribute 𝐴𝑖 ∈ 𝐴ˆ is meta-information that could be useful for other tasks. For example, a continuous numeric field, we take the minimum (min) and the in this schema, the interpreter knows which precise solver we need maximum (max) value that can be taken by 𝐴𝑖 in the domain, and to use to simulate an experiment. Each interpreter knows how to we fix 𝑡 equidistant ticks from the range [𝑚𝑖𝑛, 𝑚𝑎𝑥] and associate distinguish the primary data from the secondary information and the value of the attribute to the closest tick. Instead, if the possible correctly map them. This is possible because the interpreter has values of an attribute are not continuous but with high cardinal- multiple references 𝑀 = {𝑚 1, ..., 𝑚𝑛 } to a mapping model M that ity, we can identify a subset of the possible values leveraging a 4 DATA EXPERIMENT (EXP.) INTERPRETER (INTER.) EXP ID DATA NAME VALUE EXP ID INTER. ID INTER. ID SOLVER METADATA (META) RULE MAPPING NAME VALUE EXP ID MODEL NAME VALUE INTER. ID INTER. ID TYPE DATA TYPE NAME Figure 3: The class model used to represent the domain knowledge and interpret correctly the semantic of the experiments. METADATA (META) EXPERIMENT (EXP.) DATA hierarchy among them or using bucketization: similar values are NAME VALUE EXP ID EXP ID EXP ID DATA NAME VALUE associated with the same bucket [2]. Given an entry 𝑟 of the model INTER. ID M regarding an attribute 𝐴𝑖 ∈ 𝐴, ˆ it has a corresponding value of Reactor PFR 1 1 50 1 temperature [1, 2, 3] 𝑣 𝐴𝑖 ,𝑟 = (𝑣 1,𝑖 , ..., 𝑣𝑑𝐴 ,𝑖 ) for the attribute 𝐴𝑖 where 𝑣𝑖,𝑗 = 1 if 𝑟 has Exp. Type IDT 1 2 60 1 pressure [10, 20, 30] 𝑖 the corresponding categorical value for the attribute 𝐴𝑖 otherwise Reactor RCM 2 2 temperature [4, 5, 6] is 0. In this way, it is possible to register an array field of the model IDT Type d/dt OH 2 2 IDT [11, 22, 33] where an entry can assumed multiple categorical values for the same attribute. We use the notation 𝑣 𝐴𝑖 ,𝑟 [𝑘] to denote the 𝑘-th RULE INTERPRETER (INTER.) MAPPING value of the attribute 𝐴𝑖 with 𝑘 ∈ [1, 𝑑𝐴𝑖 ] for the entry 𝑟 . MODEL NAME VALUE INTER. ID INTER. ID ... INTER. ID TYPE DATA NAME Second, we define a multidimensional space that reflects our META Reactor PFR 50 50 ... 50 X-Axis pressure database’s coverage among the 𝐴𝑠 set of attributes with cardinality META Exp. Type IDT 50 60 ... 50 Y-Axis temperature |𝐴𝑠 | = 𝑠. Each characteristic 𝐴𝑖 ∈ 𝐴ˆ defines a dimension of the META Reactor RCM 60 60 Y-Axis IDT space of size 𝑑𝐴𝑖 . We then create a matrix, called coverage matrix META IDT Type d/dt OH 60 60 X-Axis temperature CM , with dimension 𝑑 CM = 𝑑𝐴1 × ... × 𝑑𝐴𝑠 to represent this space. Finally, after initializing all the matrix cells to 0, for every entry 𝑟 in the model M, for every possible combination of categorical Figure 4: An example of the rule-based interpretation. values of the attributes, we update the coverage matrix using Equa- tion (1) only if it holds the condition in Equation (2) for 𝑟 when reason, if we use a similarity index that quantifies the difference 𝑖𝑚 ≠ 0 with 𝑚 ∈ [1, 𝑠]. between the predicted data to the experimental data, we can au- CM [𝑖 1, ..., 𝑖𝑠 ] += 1 (1) tomatically identify an experiment that has a behavior somewhat 𝑣 𝐴1 ,𝑟 [𝑖 1 ] == ... == 𝑣 𝐴𝑠 ,𝑟 [𝑖𝑠 ] == 1 (2) different from the other similar experiments. It will then be the sci- entist who establishes what happened case by case, invalidating the The final result is a density matrix that represents the coverage experiment, if necessary, through the metadata of the state. Once of our database regarding some given categorical attributes. Imme- an iteration of the simulation-analysis-cleaning loop is terminated, diately, we can define a database coverage index: after examining the cycle can start over, and the attention is moved over another all the entries 𝑟 present in the dataset D, we can count the number experiment. Section 5 presents examples on data cleaning, database of cells with value bigger than a given threshold 𝑇 , and normalize coverage and semantic interpretation. this value on the total number of cells (Equation (3)). Í 𝑖 ∈ [1,𝑑𝐴1 ],...,𝑘 ∈ [1,𝑑𝐴𝑠 ] 1, 𝑖 𝑓 𝐶 M [𝑖, ..., 𝑘] ≥ 𝑇 5 DISCUSSION C= ∈ [0, 1] (3) 𝑑 CM The backbone of automation in a scientific data management system is the ability to understand the semantic of an experiment. In our 4.3 Data cleaning case study means distinguishing the x-axis from the y-axes and Data-driven applications are sensitive to data quality, but in do- correctly pairing the experimental properties with the simulated mains where the experimental data are rare and affected by non- data. In Figure 4 there is an example of the Interpreter assignment to negligible uncertainty, it is hard to define and measure the quality two experiments based on rules. Interpreter with ID 50 is assigned level of an experiment based on which accept or reject the insertion to experiment with ID 1. In fact, all the rules specified by this in the repository. As discussed in Section 3, the process that we have interpreter are fulfilled by the experiment. Then, thanks to the identified tries to mitigate three different data quality dimensions: interpreter, we are able to recognize the x-axis and the y-axis of consistency, completeness, and accuracy. The domain-specific auto- the experimental data. matic checks, for example, ensure consistency, examining that the SciExpeM has a database of about 500 experiments, which, as unit of measurement of a property is valid. Instead, in the verifica- described in Section 4.2, have been categorized based on two meta- tion step, the scientist completes the empty mandatory metadata data as suggested by domain experts: temperature and pressure. of the experiment. The accuracy of experimental data affected by Specifically, the temperature is tokenized from a min of 500 K to a uncertainty is hard to quantify, but the combination of a human- max of 2000 K in steps of 25 degrees. Instead, the pressure goes from in-the-loop, the predictive model, and a similarity index can help 0 to 40 bar with step 10. The coverage index using as threshold 1 is in this task. The predictive model has its own uncertainty, for this 0.88. Instead, if 3 and 5 are the thresholds, the coverage index is 0.55 5 (a) Before an iteration of the analysis-improvement loop. In red a possible outlier. In this case the data was evaluated by an expert as unreliable. Figure 5: An heatmap representing the density of the cover- age matrix. and 0.32, respectively. Figure 5 shows the density of the coverage matrix CM , which is used to calculate these indices. Through the interpreter, SciExpeM can simulate an experiment with different models and compare the results. (b) After excluding the unreliable data from the database, we With model validation, given a domain-specific similarity mea- re-analyze the same set of data, highlighting other possible sure, we measure the predictive performance of a model against sources of information/errors. a set of experimental data. The analyses of the similarity scores after the model validation provide essential information for the Figure 6: Heatmap visualization of the outlier detection in- model improvement and can also be used to improve the quality of side the human-in-the-loop process. On the y-axis different the repository itself. In fact, we can also use the predictive model models, on the x-axis different experiments. The heatmap capabilities to perform data cleaning. A rule-based approach for value depicts the Curve Matching score. data cleaning is already implemented, and it is focused on syntax or semantic rules on attributes of the database model, but it is not powerful enough to understand if the measurements contained inside the experimental data are reliable. heatmap color is rescaled accordingly, to depict that the attention We combine the use of the predictive model with some automatic will be on different experiments in the next iteration. statistical investigation tools to detect outliers [12]. For this task, we leverage the categorization of experiments described in Section 3: it is reasonable to think that the prediction performance of a model 6 CONCLUDING REMARKS over a set of data belonging to the same category, i.e., the same In this work, we have presented the problems and proposed solu- portion of the domain, is similar. A significant deviation from the tions for managing a complex database that represents an experi- average of the similarity index of a simulation is a bell for a possible mental domain. As in many cases, creating a scientific repository is outlier. As we said in Section 3, each entry of the database has not the final goal, but it is preliminary to extract value from the data. metadata that specify its status. If an entry is a possible outlier, For this purpose, we have created a human-in-the-loop process in we automatically tag it with a specific label, in the status that which the users have different tasks. First, to solve the heterogeneity alerts the human-in-the-loop that a further inspection is required. of the data, using general metadata as additional model attributes This procedure verifies if the model is wrong, providing clues for and with the help of the users, we can categorize and distinguish the model improvement or for assessing the unreliability of the which portion of the domain is precisely represented by the experi- experimental data. In the latter case, the entry status is changed to a ments. Second, using a rule-based procedure, we can automatically specific value, invalid, that implies to exclude it to further analysis, understand the semantic of experimental data. This information is but the experiment must still be there to exclude re-entering it in essential for the following automatic analyses. Finally, as in many the repository in the future. validation scenarios, the reliability of the prediction accuracy de- In our case study, we use Curve Matching [4], a similarity index pends on the coverage of the test set. For this purpose, we develop of two curves: one is the experimental data, the other one is pre- a general coverage index, that given a set of model attributes that dicted data by the model. In Figure 6, it is possible to observe one define the domain space, quantifies the domain coverage of the iteration of the continuous loop of analysis-improvement of the database. Besides, we can combine this information with a statisti- model in which both the model and the experimental database can cal investigation. Given a similarity measure and human support, be improved. In this specific case, an unreliable experiment is iden- we can establish if an experiment outlier is a source of information tified (Figure 6a) and then excluded from the following iterations for the predictive model improvement or unreliable experimental (Figure 6b) after a deeper analysis of the scientist. In Figure 6b the data, thus improving the overall database quality. 6 ACKNOWLEDGMENTS & Knowledge Management. IEEE, 300–304. [24] Carol Tenopir, Elizabeth D Dalton, Suzie Allard, Mike Frame, Ivanka Pjesivac, The work of E.R. is supported by the interdisciplinarity PhD project Ben Birch, Danielle Pollock, and Kristina Dorsett. 2015. Changes in data sharing of Politecnico di Milano. and data reuse practices and perceptions among scientists worldwide. PloS one 10, 8 (2015). [25] Tamás Varga, T Turányi, E Czinki, T Furtenbacher, and A Császár. 2015. ReSpecTh: a joint reaction kinetics, spectroscopy, and thermochemistry information system. REFERENCES In Proceedings of the 7th European Combustion Meeting, Vol. 30. 1–5. [1] Chris Allan et al. 2012. OMERO: flexible, model-driven data management for [26] Bryan W Weber and Kyle E Niemeyer. 2018. ChemKED: A Human-and Machine- experimental biology. Nature Methods 9, 3 (2012), 245–253. Readable Data Standard for Chemical Kinetics Experiments. International Journal [2] Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying of Chemical Kinetics 50, 3 (2018), 135–148. coverage for a given dataset. In 2019 IEEE 35th International Conference on Data [27] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Apple- Engineering (ICDE). IEEE, 554–565. ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino [3] Carlo Batini and Monica Scannapieco. 2016. Data and Information Quality - da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for Dimensions, Principles and Techniques. Springer. https://doi.org/10.1007/978-3- scientific data management and stewardship. Scientific Data 3, 1 (2016), 1–9. 319-24106-7 [4] Mara Sabina Bernardi, Matteo Pelucchi, Alessandro Stagni, Laura Maria Sangalli, Alberto Cuoci, Alessio Frassoldati, Piercesare Secchi, and Tiziano Faravelli. 2016. Curve matching, a generalized framework for models/experiments comparison: An application to n-heptane combustion kinetic mechanisms. Combustion and Flame 168 (2016), 186–203. [5] Louardi Bradji and Mahmoud Boufaida. 2011. A rule management system for knowledge based data cleaning. Intelligent Information Management 3, 6 (2011). [6] Kamaljit Chowdhary and Paul Dupuis. 2013. Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification. ESAIM: Mathe- matical Modelling and Numerical Analysis 47, 3 (2013), 635–662. [7] Marina Drosou, HV Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017. Diversity in big data: A review. Big data 5, 2 (2017), 73–84. [8] Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio, Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda Paja, Barbara Pernici, and Jakob Rehof. 2021. Knowledge-driven Data Ecosystems Towards Data Transparency. arXiv:2105.09312 [cs.DB] [9] Zhiqiang Gong, Ping Zhong, and Weidong Hu. 2019. Diversity in machine learning. IEEE Access 7 (2019), 64323–64350. [10] Jane Greenberg, Hollie C White, Sarah Carrier, and Ryan Scherle. 2009. A meta- data best practice for a scientific data repository. Journal of Library Metadata 9, 3-4 (2009), 194–212. [11] Francesco Guerra, Paolo Sottovia, Matteo Paganelli, and Maurizio Vincini. 2019. Big data integration of heterogeneous data sources: the re-search alps case study. In 2019 IEEE International Congress on Big Data (BigDataCongress). IEEE, 106–110. [12] Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial intelligence review 22, 2 (2004), 85–126. [13] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Gold- berg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Pro- ceedings of the VLDB Endowment 9, 12 (2016). [14] Victor R Lambert and Richard H West. 2015. Identification, correction, and comparison of detailed kinetic models. In 9th US Natl Combust Meeting, Cincinnati, OH. [15] Yin Lin, Yifan Guan, Abolfazl Asudeh, and HV Jagadish. 2020. Identifying insuffi- cient data coverage in databases with multiple relations. Proceedings of the VLDB Endowment 13, 12 (2020), 2229–2242. [16] Luigi Marini et al. 2018. Clowder: Open Source Data Management for Long Tail Data. In Proceedings of the Practice and Experience on Advanced Research Comput- ing (Pittsburgh, PA, USA) (PEARC ’18). Association for Computing Machinery. [17] Carsten Olm, István Gy Zsély, Róbert Pálvölgyi, Tamás Varga, Tibor Nagy, Henry J Curran, and Tamás Turányi. 2014. Comparison of the performance of several recent hydrogen combustion mechanisms. Combustion and Flame 161, 9 (2014), 2219–2234. [18] Barbara Pernici, Francesca Ratti, and Gabriele Scalia. 2021. About the Quality of Data and Services in Natural Sciences. Springer International Publishing, Cham, 236–248. https://doi.org/10.1007/978-3-030-73203-5_18 [19] Edoardo Ramalli, Gabriele Scalia, Barbara Pernici, Alessandro Stagni, Alberto Cuoci, and Tiziano Faravelli. 2021. Data ecosystems for scientific experiments: managing combustion experiments and simulation analyses in chemical engi- neering. Accepted for publication on Frontiers in Big Data (2021). [20] Yuji Roh, Geon Heo, and Steven Euijong Whang. 2021. A survey on data collection for machine learning: a Big Data-AI integration perspective. IEEE Transactions on Knowledge and Data Engineering 33 (2021), 1328–1347. [21] Gabriele Scalia, Colin A Grambow, Barbara Pernici, Yi-Pei Li, and William H Green. 2020. Evaluating scalable uncertainty estimation methods for deep learning-based molecular property prediction. Journal of chemical information and modeling 60, 6 (2020), 2697–2717. [22] Gabriele Scalia, Matteo Pelucchi, Alessandro Stagni, Alberto Cuoci, Tiziano Faravelli, and Barbara Pernici. 2019. Towards a scientific data framework to support scientific model development. Data Science 2, 1-2 (2019), 245–273. [23] Fatimah Sidi, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A Jabar, Hamidah Ibrahim, and Aida Mustapha. 2012. Data quality: A survey of data quality dimensions. In 2012 International Conference on Information Retrieval 7