=Paper=
{{Paper
|id=Vol-2929/paper5
|storemode=property
|title=Know Your Experiments: Interpreting Categories of Experimental Data and Their Coverage
|pdfUrl=https://ceur-ws.org/Vol-2929/paper5.pdf
|volume=Vol-2929
|authors=Edoardo Ramalli,Barbara Pernici
|dblpUrl=https://dblp.org/rec/conf/vldb/RamalliP21
}}
==Know Your Experiments: Interpreting Categories of Experimental Data and Their Coverage==
Know your experiments:
interpreting categories of experimental data and their coverage
Edoardo Ramalli Barbara Pernici
Politecnico di Milano Politecnico di Milano
Milan, Italy Milan, Italy
edoardo.ramalli@polimi.it barbara.pernici@polimi.it
ABSTRACT Metadata Experiment
Data management in scientific domains is more important than ever
ID Reactor Exp. Type T P Phi ...
due to the increasing availability of experimental data. Automati-
cally integrating and managing the information would significantly 12 PFR O.C.M 300k 1atm 0.5 ...
speed up their reuse and, in particular, the development of predic-
tive models for a given domain. However, the diversity, ambiguity, Experimental Data
and complexity of experimental data make it hard in practice. In this Temperature 800 827 855 883 ...
work, we propose a general approach to overcome these challenges, Concentration 2E-04 2E-04 3E-04 3E-03 ...
combining a human-in-the-loop process with a new methodology Pressure 1.0 1.1 1.3 1.2 ...
to understand automatically the semantics of experimental data,
which can also be used as a data cleaning procedure. In addition, Simulated Data
we focus on assessing the domain coverage of an experimental data-
base using only categorical characteristics of the domain, which is Temperature 800 827 855 883 ...
essential for model validation or to understand if and where there Concentration 0 0 0 4E-04 ...
is a need to perform additional experiments. Rho 0.9 0.9 0.8 0.6 ...
... ... ... ... ... ...
Reference Format:
Edoardo Ramalli and Barbara Pernici. Know your experiments:
interpreting categories of experimental data and their coverage. In the 2nd Experimental Data Simulated Data
Workshop on Search, Exploration, and Analysis in Heterogeneous
Datastores (SEA Data 2021).
0.0100
[Mole Fraction]
Concentration
0.0075
1 INTRODUCTION
The collection of experimental data in many disciplines has pro- 0.0050
duced a massive amount of data over the decades. However, the
quality of data and the collection methodologies have changed with 0.0025
the evolution of the research fields and the improvements in the
technology used to carry out measurements. Over the years, this 0.0000
progression has led to the availability of considerable amounts of 800 900 1000
data, but that are likely affected by ambiguity problems due to their Temperature[K]
heterogeneity and complexity.
At the same time, the increasing availability of experimental Figure 1: In the plot, a simplified example of the experimen-
data has stirred the development of predictive models to study a tal data of interest and the corresponding simulated data of
domain and improve the related technologies. These data-driven an experiment. In the tables, the tabular data and metadata
models are greedy of data, and, as a consequence, there is the of an experiment with the simulated data.
need to automatically collect, store, and manage large quantities of
information coming from different sources, representation formats,
and different quality levels. Data ecosystems address these problems
[27]. Nevertheless, their complexity often makes it hard to put these
by integrating disparate or incompatible data sources, maintaining a
ideas in practice. In fact, in many domains, the data management
specific quality level [8]. As experimental data are a precious source
system has an essential role, so that “the available data management
of value, the FAIR principles encourage the reuse and sharing of data
resources define what is discovered” [1], removing the differences,
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
the ambiguity, and making the data usable.
for the volume as a collection by its editors. This volume and its papers are published In our case study, combustion kinetics, as shown in a simplified
under the Creative Commons License Attribution 4.0 International (CC BY 4.0). snapshot in Figure 1, we work with experimental data related to
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, an experiment with associated metadata and the corresponding
Copenhagen, Denmark) on CEUR-WS.org. simulated data of a predictive model to compare them and validate
and improve the model itself. The challenge is to overcome the for the case study of combustion kinetics, removing the semantic
manual management of the data: we need to automatically interpret ambiguity of the data and provide services to analyze data and im-
the experiment, i.e., to distinguish the actual subject data of the prove the predictive model. In particular, this system has to analyze
experiment among all the data columns, simulate it with the right together multiple experimental data that compose a trend rather
solver, and pair the experimental data with the simulated data than stand-alone information.
correctly based on the content of the experiment metadata. Combustion kinetics has been the subject of study for many
We propose an iterative approach to understand and store exper- decades. For this reason, many experimental data regarding differ-
imental data with humans-in-the-loop, focusing on three aspects ent fuels in several environmental conditions have been collected
that are critical in building scientific models: interpreting the scien- over the years. The evolution of the combustion study process has
tific data, assessing the coverage of the experiments in a specific led to increasingly precise measurements, enriching the experi-
domain, and clean and improve the scientific repository. To this mental data with outline details that are decisive for a complete
purpose, we propose a rule-based interpretation of each experiment, understanding of the phenomenon. Over the last few decades, data
that enable to automatically validate and clean the data using a sim- collection has been more methodical and massive, which has al-
ilarity index. Furthermore, it is important to quantify the database lowed for the development of predictive numerical models. A nu-
coverage within the experimental domain space. The experimental merical model can simulate complex domains without the necessity
database coverage of a domain impacts the ability to assess the va- to carry out expensive experiments in terms of time and price. In
lidity of numerical simulation models developed for a domain and, particular, in the case of combustion kinetics, we can predict the
if used as a training set for machine learning models, it can have a behavior of reactors and fuels in different conditions to improve
heavy impact on the quality of the resulting model [9]. We define a their efficiency and reduce pollutants.
general index to quantify the coverage and the experiment’s density Even today, both the models and the experimental data are
distribution by combining categorical attributes of the database mainly affected by two problems that make these data sources of
schema and multidimensional matrices. heterogeneous information. The first problem regards uncertainty;
The paper is structured as follows. In Section 2, we discuss related the latter concerns the ambiguity of the information contained
work and open problems. A general approach to integrate different therein. Uncertainty can be related to the experimental imprecision
data sources with semantic heterogeneity problems is introduced or the error made by the model representing the domain. These two
in Section 3. Section 4.1 presents a rule-based approach to interpret types of uncertainty are thus defined as the aleatoric one, which
the semantic of experimental data automatically and a methodology is related to the noise present in the data, and the epistemic one
to quantify the coverage of an experimental database is presented in associated to what the model does not represent precisely [6].
Section 4.2. An automatic analysis of a numerical simulation model Similarly, ambiguities can be encountered both in the model and
using experimental data that facilitates data cleaning procedures is in the experimental data. An example regards the chemical names.
discussed in Section 4.3. A final discussion is debaded in Section 5. Many different fields deal with chemical compounds whose names
are not uniquely defined, and for this reason, diverse nomenclatures
of the same compound can be found both in different models and ex-
2 RELATED WORK perimental data. This obstacle involves a not immediate integration
In recent years there has been growing attention to the sharing of experimental data from different sources and direct comparison
and reuse of data [24]. Several projects have been developed, such of different models [14]. Another characteristic of this domain is
as EOSC (European Open Science Cloud), focused on reusing, in- that it is hard to automatically understand the experiment subjects
tegrating, and sharing data and services within the scientific com- among the various information contained in the data.
munity. An example is Clowder [16], a framework that facilitates Regarding the uncertainty in the data and the model, techniques
the development of a data management system, offering features have been developed to separate the two types of uncertainty [21],
for visualization, annotating, and management. Although Crowder but it is not easy to estimate them if a ground truth is not available.
has shown that the framework can be used in different domains, In combustion kinetics, it has been conventionally chosen to assume
each domain has its own characteristics and requires specific im- arbitrary uncertainty values, if missing, in the case of specific types
plementations that are difficult to generalize. Homer is an example of experiments and apparatuses [17].
of a system for managing experimental biological data [1]. The Different formats have been proposed to represent the combus-
heterogeneity of the collected data, represented and managed over tion kinetics experimental data to remove the ambiguity from the
time, defines what can be discovered and directly affects the quality experimental data. These formats contain mandatory or optional
and quantity of research results. For this reason, there is a need to fields that limit the freedom of each researcher in defining their
integrate and manage complex and heterogeneous scientific data fields, thus moving towards a standardization of representation.
in a system capable of extracting value from them. The integration There are mainly two representation formats, ReSpecTh [25] and
of experimental data from different sources is not an easy task: ChemKED [26], in combustion kinetics.
correct use of metadata can provide the necessary knowledge for The diversity of a dataset is a critical aspect in many practi-
preservation, access, and reuse of scientific data [10] and therefore cal applications, but it is often overlooked [7]. As a result, bias
allows immediate support for the development of immediate appli- predictors can easily be obtained, which can also have severe reper-
cations and long-term maintainability and accessibility of data. In cussions in everyday life [2]. The coverage of a database allows
this context, there is the need of a data management system that us to understand how diverse a dataset is. Recent proposals allow
offer services for integrating heterogeneous source of information quantifying the coverage of a database using recognition patterns
2
concerning categorical attributes [2], which can also be found on To tackle these problems, we define the process illustrated in
different tables [15]. These approaches are based on the definition Figure 2, that follows the entire life cycle of experimental data to
of patterns and thresholds. There is, therefore, a need to accurately guarantee a certain level of the data quality, according to different
and precisely define both the patterns and thresholds. quality dimensions, and at the same time, provide information to
Data is critical for the development of machine learning-based improve the predictive model.
models. For this reason, data management has an increasingly cen- This human-in-the-loop process is implemented within SciEx-
tral role in these activities as the results of the models are strictly peM (Scientific Experiments and Models), a framework that offers
dependent on the dataset [18]. In more recent times, the focus has different services related to the management and analysis of ex-
shifted towards the correct integration and quality of the data, and perimental and simulated data to speed up the predictive model
for this reason, the reverse operation is carried out: the models are development process in combustion kinetics [19, 22]. We associate
used iteratively to evaluate and improve the quality of the data [20]. to activities in the process additional metadata to assess the val-
This data cleaning procedure can improve the starting dataset, pay- idation state of the experimental data, status, that denotes if an
ing attention to maintain the convergence of the machine learning experiment is new in the database or if it is invalid or verified.
model [13]. Other techniques of data cleaning rely on the defini- SciExpeM uses the process for different applications: first of all,
tion of rules, on which, based on the result of the evaluation of a the user enters the experimental data in the system using, for exam-
condition, a specific operation is performed [5]. ple, an interactive form. The experiment is added to the database
and SciExpeM checks for syntax or detectable semantic errors. Ini-
tially the new data are tagged as new and they can be set to invalid
in any of the following phases if flaws in the data are detected. In a
3 SCIEXPEM second moment (activity Check experiment in the Figure 2), an ex-
In many experimental disciplines, data is collected from different pert has to verify each new experiment, checking for undetectable
sources such as repositories, literature, or private communication semantic errors and fill the incomplete experiment metadata. Once
between research laboratories. This entails having to manage var- an experiment is verified, the status field change accordingly, and
ious problems related to the heterogeneity of the data [11]. Fur- SciExpeM couples the experiment to an interpreter. Experimen-
thermore, as in combustion kinetics, there is no uniquely accepted tal data and results of simulators are records of information that
representation standard to convey this information. All this implies, we need to distinguish and pair automatically. To this purpose we
even for the most recent data, different accuracy, completeness, and propose to associate to experimental data to the concept of an in-
other data quality dimensions of the repository [23]. terpreter for the data. This entity, in particular, can recognize the
Experimental data are precious both for their rarity and for their properties that are under investigation in an experiment from the
cost in collecting them. For this reason, it is essential to accept all others that are just auxiliary information, such as environmental
the experiments and then carry out a series of automatic checks to conditions. For example, in Figure 1, the pressure is neither the
preserve the repository’s quality. For example, a possible control dependent nor the independent variable (or property) under inves-
is on the consistency between the unit of measurement and the tigation, unlike temperature and concentration. Moreover, based on
measured property. Another quality dimension to guarantee is com- the experiment details, the interpreter knows which solver needs to
pleteness: Since the data comes from different sources, times, and be used to simulate it and correctly pair the experimental data with
formats, it is essential to ensure that all the primary information the corresponding simulated ones. Finally, when the system can
of an experiment, in terms of metadata, is complete. Regarding the manage an experiment independently with its simulations, a loop
semantic accuracy of the experimental data, it is important that the starts. The simulated data are compared with the experimental data
values of the properties are within a range of reasonableness. How- using a similarity index that provides information to improve both
ever, while in the literature there has been an extensive attention to the model and the repository quality. This comparison is possible
developing techniques for managing and ensuring data quality and because we leverage a bidirectional relationship: we use the model
consistency (see for an extensive survey [3]), there are still many to validate the data and use the data to validate the model. First,
open problems in understanding the quality of data in their context using the experimental data to validate the model helps understand-
of use. In particular, in this paper we focus on using experimental ing which aspects or portions of the domain that an experiment
data in simulation model development in general, in a context in represents still need to be improved. Second, we use the model to
which the experimental error can be notoriously significant, but understand if the semantics of the experimental data is reliable: a
it is not (or cannot) easily be quantified. In this context, the prob- model that differs strongly from the experimental data is synony-
lem is the ability of identifying possible errors in the data and/or mous with an error in the model or incorrect experimental data. The
in the models, in a joint validation effort based on a data-driven human-in-the-loop approach allows assessing these discrepancies
approach. Finally, a crucial aspect for all data-driven applications and taking the appropriate actions.
is automation. Otherwise, in the case of predictive model develop-
ment, manually managing the simulations and validations of the
experiments is a wasteful and error-prone task. The problem is to be
able to provide a generic framework in order to be able to manage 4 EXPERIMENTS MANAGEMENT
experiments easily and in a domain-independent way, associating Representing, collecting, and integrating heterogeneous data in a
them with information needed for data-driven techniques, such as database are only the initial steps to extract value. In Section 4.1,
simulations and predictions. we present our approach to interpreting the semantics of the data
3
Analyze
Simulation
Add Check Verify Interpret Simulate
Experiment Experiment Experiment Experiment Experiment
Improve Model
and / or
Data Cleaning
Figure 2: A simplified schema of the experimental data process.
correctly, and then in Section 4.2 we measure the coverage of a data- knows, for example, the correct relation of dependent-independent
base in a given domain, while in Section 4.3 we focus on improving variable, or more in general, can separate the useful information
the repository quality. from the secondary one, and if necessary, pair them. In order to
associate an interpreter to an entry of the model E, we have to
associate a set of rules, 𝑅 = {𝑟 1, ..., 𝑟𝑘 }, to an interpreter. These
4.1 Rule-based Automatic Interpretation rules 𝑟 are entries of another table in the database, rule, R, where
Experiments are records of measured properties and other metadata each element specifies a name of the model 𝑁 , the attribute’s name
that characterize them. Besides, among the measurements, it is not 𝐴 and value 𝑉 . A rule 𝑟 ∈ R is fulfilled by an entry 𝑒 ∈ E if 𝐴 is
rare to find additional measured properties that specify, for example, an attribute for 𝑒 and the corresponding value of the attribute is
the environmental conditions of the measures, but without being equal to 𝑉 . The model name 𝑁 is an optional field that, if defined,
the subject of the scientific observation. This peculiarity generates specifies that the rule is not directly on an attribute of the model
ambiguity since a property could be the subject in an experiment 𝑒, but it is related to an attribute of another model 𝑁 that has a
but not in another. In practice, to manage scientific data, there is reference to the entry 𝑒. If an entry 𝑒 fulfill all the rules 𝑟 associated
the need to distinguish automatically which, among the measured to an interpreter 𝑖 ∈ I, we can associate the interpreter 𝑖 to the
properties, is the dependent and the independent variable. In this entry 𝑒.
context, we need to teach the data management system the ability
to recognize the role of each property in each experiment, keeping
in mind that what makes a property a subject of an experiment is a 4.2 Database Coverage
particular combination of metadata values of the experiment itself. The Model Validation procedure systematically measures how good
For this reason, it is necessary to define a flexible methodology the predictions of a model are, compared to the corresponding
to distinguish the subject properties from the auxiliary ones. In experimental data. To consider the result of this procedure reliable,
other words, we need to find an approach to transfer the domain the experimental database, if possible, should cover as much as
knowledge into the SciExpeM to interpret the semantics of an possible the domain with equal granularity. Database coverage can
experiment correctly and treat all the database entries with equal help in this task, providing an immediate procedure to measure the
semantics in the same way. diversity and completeness of representation of the database.
Manual management of this complex database is not feasible We leverage categorical attributes and a multidimensional ma-
because an experiment could contain dozens of measured properties, trix to represent the domain and to define a coverage index. This
and, for example, we should tag each of them correctly if they are approach overcomes the limitations of using patterns and thresh-
the subject of the experiment or not. Moreover, this procedure olds that are sensible and directly affect the measurements based
should be repeated hundreds of times, once for each experiment, on the way they are defined. We create a detailed and generic repre-
making it hard to analyze a large amount of data. Accordingly, we sentation of the database coverage that can be used to assess which
propose a methodology automatically extracting useful information part of the domain is poorly covered by data and consequently can
from a database model in which semantic heterogeneity is present. be used to start the process of Design of Experiments.
We propose a dynamic interpretation of a database model based We measure the coverage C of dataset D that regards the model
on rules, similar to what is done for data cleaning or to ensure con- M with 𝑛 attributes, 𝐴 = 𝐴1, ..., 𝐴𝑛 in three steps.
sistency and accuracy in a database [5]. In Figure 3 we can see the First, it is necessary to identify a subset of the model fields (or
class schema of the database schema that we use to implement the attributes) {𝐴1, ..., 𝐴𝑠 } = 𝐴ˆ ⊆ 𝐴, and transform them into categor-
automatic interpretation of scientific experiments. Given a model ical attributes. A categorical attribute of the model is a field that
Experiment (Exp.), E, that is an abstract representation of a model can only take a value from a restricted number of options. In this
affected by ambiguity, we have to assign, for each entry 𝑒 ∈ E, an way, any attribute 𝐴𝑖 ∈ 𝐴ˆ can only have 𝑑𝐴𝑖 different ordered
Interpreter entry of the model I. This model can save additional categorical values (or possible options). If the attribute 𝐴𝑖 ∈ 𝐴ˆ is
meta-information that could be useful for other tasks. For example, a continuous numeric field, we take the minimum (min) and the
in this schema, the interpreter knows which precise solver we need maximum (max) value that can be taken by 𝐴𝑖 in the domain, and
to use to simulate an experiment. Each interpreter knows how to we fix 𝑡 equidistant ticks from the range [𝑚𝑖𝑛, 𝑚𝑎𝑥] and associate
distinguish the primary data from the secondary information and the value of the attribute to the closest tick. Instead, if the possible
correctly map them. This is possible because the interpreter has values of an attribute are not continuous but with high cardinal-
multiple references 𝑀 = {𝑚 1, ..., 𝑚𝑛 } to a mapping model M that ity, we can identify a subset of the possible values leveraging a
4
DATA EXPERIMENT (EXP.) INTERPRETER (INTER.)
EXP ID DATA NAME VALUE EXP ID INTER. ID INTER. ID SOLVER
METADATA (META) RULE MAPPING
NAME VALUE EXP ID MODEL NAME VALUE INTER. ID INTER. ID TYPE DATA
TYPE
NAME
Figure 3: The class model used to represent the domain knowledge and interpret correctly the semantic of the experiments.
METADATA (META) EXPERIMENT (EXP.) DATA
hierarchy among them or using bucketization: similar values are
NAME VALUE EXP ID EXP ID EXP ID DATA NAME VALUE
associated with the same bucket [2]. Given an entry 𝑟 of the model INTER. ID
M regarding an attribute 𝐴𝑖 ∈ 𝐴, ˆ it has a corresponding value of Reactor PFR 1 1 50 1 temperature [1, 2, 3]
𝑣 𝐴𝑖 ,𝑟 = (𝑣 1,𝑖 , ..., 𝑣𝑑𝐴 ,𝑖 ) for the attribute 𝐴𝑖 where 𝑣𝑖,𝑗 = 1 if 𝑟 has Exp. Type IDT 1 2 60 1 pressure [10, 20, 30]
𝑖
the corresponding categorical value for the attribute 𝐴𝑖 otherwise Reactor RCM 2 2 temperature [4, 5, 6]
is 0. In this way, it is possible to register an array field of the model IDT Type d/dt OH 2 2 IDT [11, 22, 33]
where an entry can assumed multiple categorical values for the
same attribute. We use the notation 𝑣 𝐴𝑖 ,𝑟 [𝑘] to denote the 𝑘-th RULE INTERPRETER (INTER.) MAPPING
value of the attribute 𝐴𝑖 with 𝑘 ∈ [1, 𝑑𝐴𝑖 ] for the entry 𝑟 . MODEL NAME VALUE INTER. ID INTER. ID
...
INTER. ID TYPE DATA NAME
Second, we define a multidimensional space that reflects our META Reactor PFR 50 50 ... 50 X-Axis pressure
database’s coverage among the 𝐴𝑠 set of attributes with cardinality META Exp. Type IDT 50 60 ... 50 Y-Axis temperature
|𝐴𝑠 | = 𝑠. Each characteristic 𝐴𝑖 ∈ 𝐴ˆ defines a dimension of the META Reactor RCM 60 60 Y-Axis IDT
space of size 𝑑𝐴𝑖 . We then create a matrix, called coverage matrix META IDT Type d/dt OH 60 60 X-Axis temperature
CM , with dimension 𝑑 CM = 𝑑𝐴1 × ... × 𝑑𝐴𝑠 to represent this space.
Finally, after initializing all the matrix cells to 0, for every entry
𝑟 in the model M, for every possible combination of categorical Figure 4: An example of the rule-based interpretation.
values of the attributes, we update the coverage matrix using Equa-
tion (1) only if it holds the condition in Equation (2) for 𝑟 when
reason, if we use a similarity index that quantifies the difference
𝑖𝑚 ≠ 0 with 𝑚 ∈ [1, 𝑠].
between the predicted data to the experimental data, we can au-
CM [𝑖 1, ..., 𝑖𝑠 ] += 1 (1) tomatically identify an experiment that has a behavior somewhat
𝑣 𝐴1 ,𝑟 [𝑖 1 ] == ... == 𝑣 𝐴𝑠 ,𝑟 [𝑖𝑠 ] == 1 (2) different from the other similar experiments. It will then be the sci-
entist who establishes what happened case by case, invalidating the
The final result is a density matrix that represents the coverage
experiment, if necessary, through the metadata of the state. Once
of our database regarding some given categorical attributes. Imme-
an iteration of the simulation-analysis-cleaning loop is terminated,
diately, we can define a database coverage index: after examining
the cycle can start over, and the attention is moved over another
all the entries 𝑟 present in the dataset D, we can count the number
experiment. Section 5 presents examples on data cleaning, database
of cells with value bigger than a given threshold 𝑇 , and normalize
coverage and semantic interpretation.
this value on the total number of cells (Equation (3)).
Í
𝑖 ∈ [1,𝑑𝐴1 ],...,𝑘 ∈ [1,𝑑𝐴𝑠 ] 1, 𝑖 𝑓 𝐶 M [𝑖, ..., 𝑘] ≥ 𝑇 5 DISCUSSION
C= ∈ [0, 1] (3)
𝑑 CM The backbone of automation in a scientific data management system
is the ability to understand the semantic of an experiment. In our
4.3 Data cleaning case study means distinguishing the x-axis from the y-axes and
Data-driven applications are sensitive to data quality, but in do- correctly pairing the experimental properties with the simulated
mains where the experimental data are rare and affected by non- data. In Figure 4 there is an example of the Interpreter assignment to
negligible uncertainty, it is hard to define and measure the quality two experiments based on rules. Interpreter with ID 50 is assigned
level of an experiment based on which accept or reject the insertion to experiment with ID 1. In fact, all the rules specified by this
in the repository. As discussed in Section 3, the process that we have interpreter are fulfilled by the experiment. Then, thanks to the
identified tries to mitigate three different data quality dimensions: interpreter, we are able to recognize the x-axis and the y-axis of
consistency, completeness, and accuracy. The domain-specific auto- the experimental data.
matic checks, for example, ensure consistency, examining that the SciExpeM has a database of about 500 experiments, which, as
unit of measurement of a property is valid. Instead, in the verifica- described in Section 4.2, have been categorized based on two meta-
tion step, the scientist completes the empty mandatory metadata data as suggested by domain experts: temperature and pressure.
of the experiment. The accuracy of experimental data affected by Specifically, the temperature is tokenized from a min of 500 K to a
uncertainty is hard to quantify, but the combination of a human- max of 2000 K in steps of 25 degrees. Instead, the pressure goes from
in-the-loop, the predictive model, and a similarity index can help 0 to 40 bar with step 10. The coverage index using as threshold 1 is
in this task. The predictive model has its own uncertainty, for this 0.88. Instead, if 3 and 5 are the thresholds, the coverage index is 0.55
5
(a) Before an iteration of the analysis-improvement loop. In
red a possible outlier. In this case the data was evaluated by
an expert as unreliable.
Figure 5: An heatmap representing the density of the cover-
age matrix.
and 0.32, respectively. Figure 5 shows the density of the coverage
matrix CM , which is used to calculate these indices.
Through the interpreter, SciExpeM can simulate an experiment
with different models and compare the results. (b) After excluding the unreliable data from the database, we
With model validation, given a domain-specific similarity mea- re-analyze the same set of data, highlighting other possible
sure, we measure the predictive performance of a model against sources of information/errors.
a set of experimental data. The analyses of the similarity scores
after the model validation provide essential information for the Figure 6: Heatmap visualization of the outlier detection in-
model improvement and can also be used to improve the quality of side the human-in-the-loop process. On the y-axis different
the repository itself. In fact, we can also use the predictive model models, on the x-axis different experiments. The heatmap
capabilities to perform data cleaning. A rule-based approach for value depicts the Curve Matching score.
data cleaning is already implemented, and it is focused on syntax
or semantic rules on attributes of the database model, but it is not
powerful enough to understand if the measurements contained
inside the experimental data are reliable. heatmap color is rescaled accordingly, to depict that the attention
We combine the use of the predictive model with some automatic will be on different experiments in the next iteration.
statistical investigation tools to detect outliers [12]. For this task, we
leverage the categorization of experiments described in Section 3:
it is reasonable to think that the prediction performance of a model 6 CONCLUDING REMARKS
over a set of data belonging to the same category, i.e., the same In this work, we have presented the problems and proposed solu-
portion of the domain, is similar. A significant deviation from the tions for managing a complex database that represents an experi-
average of the similarity index of a simulation is a bell for a possible mental domain. As in many cases, creating a scientific repository is
outlier. As we said in Section 3, each entry of the database has not the final goal, but it is preliminary to extract value from the data.
metadata that specify its status. If an entry is a possible outlier, For this purpose, we have created a human-in-the-loop process in
we automatically tag it with a specific label, in the status that which the users have different tasks. First, to solve the heterogeneity
alerts the human-in-the-loop that a further inspection is required. of the data, using general metadata as additional model attributes
This procedure verifies if the model is wrong, providing clues for and with the help of the users, we can categorize and distinguish
the model improvement or for assessing the unreliability of the which portion of the domain is precisely represented by the experi-
experimental data. In the latter case, the entry status is changed to a ments. Second, using a rule-based procedure, we can automatically
specific value, invalid, that implies to exclude it to further analysis, understand the semantic of experimental data. This information is
but the experiment must still be there to exclude re-entering it in essential for the following automatic analyses. Finally, as in many
the repository in the future. validation scenarios, the reliability of the prediction accuracy de-
In our case study, we use Curve Matching [4], a similarity index pends on the coverage of the test set. For this purpose, we develop
of two curves: one is the experimental data, the other one is pre- a general coverage index, that given a set of model attributes that
dicted data by the model. In Figure 6, it is possible to observe one define the domain space, quantifies the domain coverage of the
iteration of the continuous loop of analysis-improvement of the database. Besides, we can combine this information with a statisti-
model in which both the model and the experimental database can cal investigation. Given a similarity measure and human support,
be improved. In this specific case, an unreliable experiment is iden- we can establish if an experiment outlier is a source of information
tified (Figure 6a) and then excluded from the following iterations for the predictive model improvement or unreliable experimental
(Figure 6b) after a deeper analysis of the scientist. In Figure 6b the data, thus improving the overall database quality.
6
ACKNOWLEDGMENTS & Knowledge Management. IEEE, 300–304.
[24] Carol Tenopir, Elizabeth D Dalton, Suzie Allard, Mike Frame, Ivanka Pjesivac,
The work of E.R. is supported by the interdisciplinarity PhD project Ben Birch, Danielle Pollock, and Kristina Dorsett. 2015. Changes in data sharing
of Politecnico di Milano. and data reuse practices and perceptions among scientists worldwide. PloS one
10, 8 (2015).
[25] Tamás Varga, T Turányi, E Czinki, T Furtenbacher, and A Császár. 2015. ReSpecTh:
a joint reaction kinetics, spectroscopy, and thermochemistry information system.
REFERENCES In Proceedings of the 7th European Combustion Meeting, Vol. 30. 1–5.
[1] Chris Allan et al. 2012. OMERO: flexible, model-driven data management for [26] Bryan W Weber and Kyle E Niemeyer. 2018. ChemKED: A Human-and Machine-
experimental biology. Nature Methods 9, 3 (2012), 245–253. Readable Data Standard for Chemical Kinetics Experiments. International Journal
[2] Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying of Chemical Kinetics 50, 3 (2018), 135–148.
coverage for a given dataset. In 2019 IEEE 35th International Conference on Data [27] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Apple-
Engineering (ICDE). IEEE, 554–565. ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino
[3] Carlo Batini and Monica Scannapieco. 2016. Data and Information Quality - da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for
Dimensions, Principles and Techniques. Springer. https://doi.org/10.1007/978-3- scientific data management and stewardship. Scientific Data 3, 1 (2016), 1–9.
319-24106-7
[4] Mara Sabina Bernardi, Matteo Pelucchi, Alessandro Stagni, Laura Maria Sangalli,
Alberto Cuoci, Alessio Frassoldati, Piercesare Secchi, and Tiziano Faravelli. 2016.
Curve matching, a generalized framework for models/experiments comparison:
An application to n-heptane combustion kinetic mechanisms. Combustion and
Flame 168 (2016), 186–203.
[5] Louardi Bradji and Mahmoud Boufaida. 2011. A rule management system for
knowledge based data cleaning. Intelligent Information Management 3, 6 (2011).
[6] Kamaljit Chowdhary and Paul Dupuis. 2013. Distinguishing and integrating
aleatoric and epistemic variation in uncertainty quantification. ESAIM: Mathe-
matical Modelling and Numerical Analysis 47, 3 (2013), 635–662.
[7] Marina Drosou, HV Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017.
Diversity in big data: A review. Big data 5, 2 (2017), 73–84.
[8] Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio,
Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda
Paja, Barbara Pernici, and Jakob Rehof. 2021. Knowledge-driven Data Ecosystems
Towards Data Transparency. arXiv:2105.09312 [cs.DB]
[9] Zhiqiang Gong, Ping Zhong, and Weidong Hu. 2019. Diversity in machine
learning. IEEE Access 7 (2019), 64323–64350.
[10] Jane Greenberg, Hollie C White, Sarah Carrier, and Ryan Scherle. 2009. A meta-
data best practice for a scientific data repository. Journal of Library Metadata 9,
3-4 (2009), 194–212.
[11] Francesco Guerra, Paolo Sottovia, Matteo Paganelli, and Maurizio Vincini. 2019.
Big data integration of heterogeneous data sources: the re-search alps case study.
In 2019 IEEE International Congress on Big Data (BigDataCongress). IEEE, 106–110.
[12] Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies.
Artificial intelligence review 22, 2 (2004), 85–126.
[13] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Gold-
berg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Pro-
ceedings of the VLDB Endowment 9, 12 (2016).
[14] Victor R Lambert and Richard H West. 2015. Identification, correction, and
comparison of detailed kinetic models. In 9th US Natl Combust Meeting, Cincinnati,
OH.
[15] Yin Lin, Yifan Guan, Abolfazl Asudeh, and HV Jagadish. 2020. Identifying insuffi-
cient data coverage in databases with multiple relations. Proceedings of the VLDB
Endowment 13, 12 (2020), 2229–2242.
[16] Luigi Marini et al. 2018. Clowder: Open Source Data Management for Long Tail
Data. In Proceedings of the Practice and Experience on Advanced Research Comput-
ing (Pittsburgh, PA, USA) (PEARC ’18). Association for Computing Machinery.
[17] Carsten Olm, István Gy Zsély, Róbert Pálvölgyi, Tamás Varga, Tibor Nagy, Henry J
Curran, and Tamás Turányi. 2014. Comparison of the performance of several
recent hydrogen combustion mechanisms. Combustion and Flame 161, 9 (2014),
2219–2234.
[18] Barbara Pernici, Francesca Ratti, and Gabriele Scalia. 2021. About the Quality of
Data and Services in Natural Sciences. Springer International Publishing, Cham,
236–248. https://doi.org/10.1007/978-3-030-73203-5_18
[19] Edoardo Ramalli, Gabriele Scalia, Barbara Pernici, Alessandro Stagni, Alberto
Cuoci, and Tiziano Faravelli. 2021. Data ecosystems for scientific experiments:
managing combustion experiments and simulation analyses in chemical engi-
neering. Accepted for publication on Frontiers in Big Data (2021).
[20] Yuji Roh, Geon Heo, and Steven Euijong Whang. 2021. A survey on data collection
for machine learning: a Big Data-AI integration perspective. IEEE Transactions
on Knowledge and Data Engineering 33 (2021), 1328–1347.
[21] Gabriele Scalia, Colin A Grambow, Barbara Pernici, Yi-Pei Li, and William H
Green. 2020. Evaluating scalable uncertainty estimation methods for deep
learning-based molecular property prediction. Journal of chemical information
and modeling 60, 6 (2020), 2697–2717.
[22] Gabriele Scalia, Matteo Pelucchi, Alessandro Stagni, Alberto Cuoci, Tiziano
Faravelli, and Barbara Pernici. 2019. Towards a scientific data framework to
support scientific model development. Data Science 2, 1-2 (2019), 245–273.
[23] Fatimah Sidi, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A
Jabar, Hamidah Ibrahim, and Aida Mustapha. 2012. Data quality: A survey of
data quality dimensions. In 2012 International Conference on Information Retrieval
7