Know your experiments:
 interpreting categories of experimental data and their coverage
                               Edoardo Ramalli                                                                              Barbara Pernici
                             Politecnico di Milano                                                                     Politecnico di Milano
                                  Milan, Italy                                                                              Milan, Italy
                           edoardo.ramalli@polimi.it                                                                 barbara.pernici@polimi.it

ABSTRACT                                                                                                                    Metadata Experiment
Data management in scientific domains is more important than ever
                                                                                                ID          Reactor          Exp. Type           T        P       Phi      ...
due to the increasing availability of experimental data. Automati-
cally integrating and managing the information would significantly                              12          PFR              O.C.M               300k     1atm    0.5      ...
speed up their reuse and, in particular, the development of predic-
tive models for a given domain. However, the diversity, ambiguity,                                                           Experimental Data
and complexity of experimental data make it hard in practice. In this                           Temperature                       800      827          855      883       ...
work, we propose a general approach to overcome these challenges,                               Concentration                     2E-04    2E-04        3E-04    3E-03     ...
combining a human-in-the-loop process with a new methodology                                    Pressure                          1.0      1.1          1.3      1.2       ...
to understand automatically the semantics of experimental data,
which can also be used as a data cleaning procedure. In addition,                                                                 Simulated Data
we focus on assessing the domain coverage of an experimental data-
base using only categorical characteristics of the domain, which is                                   Temperature                    800    827     855     883      ...
essential for model validation or to understand if and where there                                    Concentration                  0      0       0       4E-04    ...
is a need to perform additional experiments.                                                          Rho                            0.9    0.9     0.8     0.6      ...
                                                                                                      ...                            ...    ...     ...     ...      ...
Reference Format:
Edoardo Ramalli and Barbara Pernici. Know your experiments:
interpreting categories of experimental data and their coverage. In the 2nd                                         Experimental Data                    Simulated Data
Workshop on Search, Exploration, and Analysis in Heterogeneous
Datastores (SEA Data 2021).
                                                                                                                   0.0100
                                                                                                 [Mole Fraction]
                                                                                                  Concentration


                                                                                                                   0.0075
1    INTRODUCTION
The collection of experimental data in many disciplines has pro-                                                   0.0050
duced a massive amount of data over the decades. However, the
quality of data and the collection methodologies have changed with                                                 0.0025
the evolution of the research fields and the improvements in the
technology used to carry out measurements. Over the years, this                                                    0.0000
progression has led to the availability of considerable amounts of                                                          800            900           1000
data, but that are likely affected by ambiguity problems due to their                                                                Temperature[K]
heterogeneity and complexity.
   At the same time, the increasing availability of experimental                          Figure 1: In the plot, a simplified example of the experimen-
data has stirred the development of predictive models to study a                          tal data of interest and the corresponding simulated data of
domain and improve the related technologies. These data-driven                            an experiment. In the tables, the tabular data and metadata
models are greedy of data, and, as a consequence, there is the                            of an experiment with the simulated data.
need to automatically collect, store, and manage large quantities of
information coming from different sources, representation formats,
and different quality levels. Data ecosystems address these problems
                                                                                          [27]. Nevertheless, their complexity often makes it hard to put these
by integrating disparate or incompatible data sources, maintaining a
                                                                                          ideas in practice. In fact, in many domains, the data management
specific quality level [8]. As experimental data are a precious source
                                                                                          system has an essential role, so that “the available data management
of value, the FAIR principles encourage the reuse and sharing of data
                                                                                          resources define what is discovered” [1], removing the differences,
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
                                                                                          the ambiguity, and making the data usable.
for the volume as a collection by its editors. This volume and its papers are published      In our case study, combustion kinetics, as shown in a simplified
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).             snapshot in Figure 1, we work with experimental data related to
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,          an experiment with associated metadata and the corresponding
Copenhagen, Denmark) on CEUR-WS.org.                                                      simulated data of a predictive model to compare them and validate
and improve the model itself. The challenge is to overcome the                 for the case study of combustion kinetics, removing the semantic
manual management of the data: we need to automatically interpret              ambiguity of the data and provide services to analyze data and im-
the experiment, i.e., to distinguish the actual subject data of the            prove the predictive model. In particular, this system has to analyze
experiment among all the data columns, simulate it with the right              together multiple experimental data that compose a trend rather
solver, and pair the experimental data with the simulated data                 than stand-alone information.
correctly based on the content of the experiment metadata.                        Combustion kinetics has been the subject of study for many
    We propose an iterative approach to understand and store exper-            decades. For this reason, many experimental data regarding differ-
imental data with humans-in-the-loop, focusing on three aspects                ent fuels in several environmental conditions have been collected
that are critical in building scientific models: interpreting the scien-       over the years. The evolution of the combustion study process has
tific data, assessing the coverage of the experiments in a specific            led to increasingly precise measurements, enriching the experi-
domain, and clean and improve the scientific repository. To this               mental data with outline details that are decisive for a complete
purpose, we propose a rule-based interpretation of each experiment,            understanding of the phenomenon. Over the last few decades, data
that enable to automatically validate and clean the data using a sim-          collection has been more methodical and massive, which has al-
ilarity index. Furthermore, it is important to quantify the database           lowed for the development of predictive numerical models. A nu-
coverage within the experimental domain space. The experimental                merical model can simulate complex domains without the necessity
database coverage of a domain impacts the ability to assess the va-            to carry out expensive experiments in terms of time and price. In
lidity of numerical simulation models developed for a domain and,              particular, in the case of combustion kinetics, we can predict the
if used as a training set for machine learning models, it can have a           behavior of reactors and fuels in different conditions to improve
heavy impact on the quality of the resulting model [9]. We define a            their efficiency and reduce pollutants.
general index to quantify the coverage and the experiment’s density               Even today, both the models and the experimental data are
distribution by combining categorical attributes of the database               mainly affected by two problems that make these data sources of
schema and multidimensional matrices.                                          heterogeneous information. The first problem regards uncertainty;
    The paper is structured as follows. In Section 2, we discuss related       the latter concerns the ambiguity of the information contained
work and open problems. A general approach to integrate different              therein. Uncertainty can be related to the experimental imprecision
data sources with semantic heterogeneity problems is introduced                or the error made by the model representing the domain. These two
in Section 3. Section 4.1 presents a rule-based approach to interpret          types of uncertainty are thus defined as the aleatoric one, which
the semantic of experimental data automatically and a methodology              is related to the noise present in the data, and the epistemic one
to quantify the coverage of an experimental database is presented in           associated to what the model does not represent precisely [6].
Section 4.2. An automatic analysis of a numerical simulation model                Similarly, ambiguities can be encountered both in the model and
using experimental data that facilitates data cleaning procedures is           in the experimental data. An example regards the chemical names.
discussed in Section 4.3. A final discussion is debaded in Section 5.          Many different fields deal with chemical compounds whose names
                                                                               are not uniquely defined, and for this reason, diverse nomenclatures
                                                                               of the same compound can be found both in different models and ex-
2    RELATED WORK                                                              perimental data. This obstacle involves a not immediate integration
In recent years there has been growing attention to the sharing                of experimental data from different sources and direct comparison
and reuse of data [24]. Several projects have been developed, such             of different models [14]. Another characteristic of this domain is
as EOSC (European Open Science Cloud), focused on reusing, in-                 that it is hard to automatically understand the experiment subjects
tegrating, and sharing data and services within the scientific com-            among the various information contained in the data.
munity. An example is Clowder [16], a framework that facilitates                  Regarding the uncertainty in the data and the model, techniques
the development of a data management system, offering features                 have been developed to separate the two types of uncertainty [21],
for visualization, annotating, and management. Although Crowder                but it is not easy to estimate them if a ground truth is not available.
has shown that the framework can be used in different domains,                 In combustion kinetics, it has been conventionally chosen to assume
each domain has its own characteristics and requires specific im-              arbitrary uncertainty values, if missing, in the case of specific types
plementations that are difficult to generalize. Homer is an example            of experiments and apparatuses [17].
of a system for managing experimental biological data [1]. The                    Different formats have been proposed to represent the combus-
heterogeneity of the collected data, represented and managed over              tion kinetics experimental data to remove the ambiguity from the
time, defines what can be discovered and directly affects the quality          experimental data. These formats contain mandatory or optional
and quantity of research results. For this reason, there is a need to          fields that limit the freedom of each researcher in defining their
integrate and manage complex and heterogeneous scientific data                 fields, thus moving towards a standardization of representation.
in a system capable of extracting value from them. The integration             There are mainly two representation formats, ReSpecTh [25] and
of experimental data from different sources is not an easy task:               ChemKED [26], in combustion kinetics.
correct use of metadata can provide the necessary knowledge for                   The diversity of a dataset is a critical aspect in many practi-
preservation, access, and reuse of scientific data [10] and therefore          cal applications, but it is often overlooked [7]. As a result, bias
allows immediate support for the development of immediate appli-               predictors can easily be obtained, which can also have severe reper-
cations and long-term maintainability and accessibility of data. In            cussions in everyday life [2]. The coverage of a database allows
this context, there is the need of a data management system that               us to understand how diverse a dataset is. Recent proposals allow
offer services for integrating heterogeneous source of information             quantifying the coverage of a database using recognition patterns
                                                                           2
concerning categorical attributes [2], which can also be found on                To tackle these problems, we define the process illustrated in
different tables [15]. These approaches are based on the definition           Figure 2, that follows the entire life cycle of experimental data to
of patterns and thresholds. There is, therefore, a need to accurately         guarantee a certain level of the data quality, according to different
and precisely define both the patterns and thresholds.                        quality dimensions, and at the same time, provide information to
   Data is critical for the development of machine learning-based             improve the predictive model.
models. For this reason, data management has an increasingly cen-                This human-in-the-loop process is implemented within SciEx-
tral role in these activities as the results of the models are strictly       peM (Scientific Experiments and Models), a framework that offers
dependent on the dataset [18]. In more recent times, the focus has            different services related to the management and analysis of ex-
shifted towards the correct integration and quality of the data, and          perimental and simulated data to speed up the predictive model
for this reason, the reverse operation is carried out: the models are         development process in combustion kinetics [19, 22]. We associate
used iteratively to evaluate and improve the quality of the data [20].        to activities in the process additional metadata to assess the val-
This data cleaning procedure can improve the starting dataset, pay-           idation state of the experimental data, status, that denotes if an
ing attention to maintain the convergence of the machine learning             experiment is new in the database or if it is invalid or verified.
model [13]. Other techniques of data cleaning rely on the defini-                SciExpeM uses the process for different applications: first of all,
tion of rules, on which, based on the result of the evaluation of a           the user enters the experimental data in the system using, for exam-
condition, a specific operation is performed [5].                             ple, an interactive form. The experiment is added to the database
                                                                              and SciExpeM checks for syntax or detectable semantic errors. Ini-
                                                                              tially the new data are tagged as new and they can be set to invalid
                                                                              in any of the following phases if flaws in the data are detected. In a
3   SCIEXPEM                                                                  second moment (activity Check experiment in the Figure 2), an ex-
In many experimental disciplines, data is collected from different            pert has to verify each new experiment, checking for undetectable
sources such as repositories, literature, or private communication            semantic errors and fill the incomplete experiment metadata. Once
between research laboratories. This entails having to manage var-             an experiment is verified, the status field change accordingly, and
ious problems related to the heterogeneity of the data [11]. Fur-             SciExpeM couples the experiment to an interpreter. Experimen-
thermore, as in combustion kinetics, there is no uniquely accepted            tal data and results of simulators are records of information that
representation standard to convey this information. All this implies,         we need to distinguish and pair automatically. To this purpose we
even for the most recent data, different accuracy, completeness, and          propose to associate to experimental data to the concept of an in-
other data quality dimensions of the repository [23].                         terpreter for the data. This entity, in particular, can recognize the
    Experimental data are precious both for their rarity and for their        properties that are under investigation in an experiment from the
cost in collecting them. For this reason, it is essential to accept all       others that are just auxiliary information, such as environmental
the experiments and then carry out a series of automatic checks to            conditions. For example, in Figure 1, the pressure is neither the
preserve the repository’s quality. For example, a possible control            dependent nor the independent variable (or property) under inves-
is on the consistency between the unit of measurement and the                 tigation, unlike temperature and concentration. Moreover, based on
measured property. Another quality dimension to guarantee is com-             the experiment details, the interpreter knows which solver needs to
pleteness: Since the data comes from different sources, times, and            be used to simulate it and correctly pair the experimental data with
formats, it is essential to ensure that all the primary information           the corresponding simulated ones. Finally, when the system can
of an experiment, in terms of metadata, is complete. Regarding the            manage an experiment independently with its simulations, a loop
semantic accuracy of the experimental data, it is important that the          starts. The simulated data are compared with the experimental data
values of the properties are within a range of reasonableness. How-           using a similarity index that provides information to improve both
ever, while in the literature there has been an extensive attention to        the model and the repository quality. This comparison is possible
developing techniques for managing and ensuring data quality and              because we leverage a bidirectional relationship: we use the model
consistency (see for an extensive survey [3]), there are still many           to validate the data and use the data to validate the model. First,
open problems in understanding the quality of data in their context           using the experimental data to validate the model helps understand-
of use. In particular, in this paper we focus on using experimental           ing which aspects or portions of the domain that an experiment
data in simulation model development in general, in a context in              represents still need to be improved. Second, we use the model to
which the experimental error can be notoriously significant, but              understand if the semantics of the experimental data is reliable: a
it is not (or cannot) easily be quantified. In this context, the prob-        model that differs strongly from the experimental data is synony-
lem is the ability of identifying possible errors in the data and/or          mous with an error in the model or incorrect experimental data. The
in the models, in a joint validation effort based on a data-driven            human-in-the-loop approach allows assessing these discrepancies
approach. Finally, a crucial aspect for all data-driven applications          and taking the appropriate actions.
is automation. Otherwise, in the case of predictive model develop-
ment, manually managing the simulations and validations of the
experiments is a wasteful and error-prone task. The problem is to be
able to provide a generic framework in order to be able to manage             4   EXPERIMENTS MANAGEMENT
experiments easily and in a domain-independent way, associating               Representing, collecting, and integrating heterogeneous data in a
them with information needed for data-driven techniques, such as              database are only the initial steps to extract value. In Section 4.1,
simulations and predictions.                                                  we present our approach to interpreting the semantics of the data
                                                                          3
                                                                                                                    Analyze
                                                                                                                   Simulation


                                Add             Check            Verify             Interpret        Simulate
                             Experiment       Experiment       Experiment          Experiment       Experiment


                                                                                                                 Improve Model
                                                                                                                     and / or
                                                                                                                  Data Cleaning


                                     Figure 2: A simplified schema of the experimental data process.


correctly, and then in Section 4.2 we measure the coverage of a data-           knows, for example, the correct relation of dependent-independent
base in a given domain, while in Section 4.3 we focus on improving              variable, or more in general, can separate the useful information
the repository quality.                                                         from the secondary one, and if necessary, pair them. In order to
                                                                                associate an interpreter to an entry of the model E, we have to
                                                                                associate a set of rules, 𝑅 = {𝑟 1, ..., 𝑟𝑘 }, to an interpreter. These
4.1    Rule-based Automatic Interpretation                                      rules 𝑟 are entries of another table in the database, rule, R, where
Experiments are records of measured properties and other metadata               each element specifies a name of the model 𝑁 , the attribute’s name
that characterize them. Besides, among the measurements, it is not              𝐴 and value 𝑉 . A rule 𝑟 ∈ R is fulfilled by an entry 𝑒 ∈ E if 𝐴 is
rare to find additional measured properties that specify, for example,          an attribute for 𝑒 and the corresponding value of the attribute is
the environmental conditions of the measures, but without being                 equal to 𝑉 . The model name 𝑁 is an optional field that, if defined,
the subject of the scientific observation. This peculiarity generates           specifies that the rule is not directly on an attribute of the model
ambiguity since a property could be the subject in an experiment                𝑒, but it is related to an attribute of another model 𝑁 that has a
but not in another. In practice, to manage scientific data, there is            reference to the entry 𝑒. If an entry 𝑒 fulfill all the rules 𝑟 associated
the need to distinguish automatically which, among the measured                 to an interpreter 𝑖 ∈ I, we can associate the interpreter 𝑖 to the
properties, is the dependent and the independent variable. In this              entry 𝑒.
context, we need to teach the data management system the ability
to recognize the role of each property in each experiment, keeping
in mind that what makes a property a subject of an experiment is a              4.2     Database Coverage
particular combination of metadata values of the experiment itself.             The Model Validation procedure systematically measures how good
   For this reason, it is necessary to define a flexible methodology            the predictions of a model are, compared to the corresponding
to distinguish the subject properties from the auxiliary ones. In               experimental data. To consider the result of this procedure reliable,
other words, we need to find an approach to transfer the domain                 the experimental database, if possible, should cover as much as
knowledge into the SciExpeM to interpret the semantics of an                    possible the domain with equal granularity. Database coverage can
experiment correctly and treat all the database entries with equal              help in this task, providing an immediate procedure to measure the
semantics in the same way.                                                      diversity and completeness of representation of the database.
   Manual management of this complex database is not feasible                      We leverage categorical attributes and a multidimensional ma-
because an experiment could contain dozens of measured properties,              trix to represent the domain and to define a coverage index. This
and, for example, we should tag each of them correctly if they are              approach overcomes the limitations of using patterns and thresh-
the subject of the experiment or not. Moreover, this procedure                  olds that are sensible and directly affect the measurements based
should be repeated hundreds of times, once for each experiment,                 on the way they are defined. We create a detailed and generic repre-
making it hard to analyze a large amount of data. Accordingly, we               sentation of the database coverage that can be used to assess which
propose a methodology automatically extracting useful information               part of the domain is poorly covered by data and consequently can
from a database model in which semantic heterogeneity is present.               be used to start the process of Design of Experiments.
   We propose a dynamic interpretation of a database model based                   We measure the coverage C of dataset D that regards the model
on rules, similar to what is done for data cleaning or to ensure con-           M with 𝑛 attributes, 𝐴 = 𝐴1, ..., 𝐴𝑛 in three steps.
sistency and accuracy in a database [5]. In Figure 3 we can see the                First, it is necessary to identify a subset of the model fields (or
class schema of the database schema that we use to implement the                attributes) {𝐴1, ..., 𝐴𝑠 } = 𝐴ˆ ⊆ 𝐴, and transform them into categor-
automatic interpretation of scientific experiments. Given a model               ical attributes. A categorical attribute of the model is a field that
Experiment (Exp.), E, that is an abstract representation of a model             can only take a value from a restricted number of options. In this
affected by ambiguity, we have to assign, for each entry 𝑒 ∈ E, an              way, any attribute 𝐴𝑖 ∈ 𝐴ˆ can only have 𝑑𝐴𝑖 different ordered
Interpreter entry of the model I. This model can save additional                categorical values (or possible options). If the attribute 𝐴𝑖 ∈ 𝐴ˆ is
meta-information that could be useful for other tasks. For example,             a continuous numeric field, we take the minimum (min) and the
in this schema, the interpreter knows which precise solver we need              maximum (max) value that can be taken by 𝐴𝑖 in the domain, and
to use to simulate an experiment. Each interpreter knows how to                 we fix 𝑡 equidistant ticks from the range [𝑚𝑖𝑛, 𝑚𝑎𝑥] and associate
distinguish the primary data from the secondary information and                 the value of the attribute to the closest tick. Instead, if the possible
correctly map them. This is possible because the interpreter has                values of an attribute are not continuous but with high cardinal-
multiple references 𝑀 = {𝑚 1, ..., 𝑚𝑛 } to a mapping model M that               ity, we can identify a subset of the possible values leveraging a
                                                                            4
                                         DATA                             EXPERIMENT (EXP.)                                                      INTERPRETER (INTER.)


                             EXP ID   DATA NAME      VALUE                EXP ID          INTER. ID                                              INTER. ID        SOLVER


                                METADATA (META)                                    RULE                                                            MAPPING


                              NAME     VALUE      EXP ID          MODEL     NAME          VALUE    INTER. ID                         INTER. ID        TYPE        DATA
                                                                                                                                                                    TYPE
                                                                                                                                                                       NAME


Figure 3: The class model used to represent the domain knowledge and interpret correctly the semantic of the experiments.

                                                                                                  METADATA (META)                                    EXPERIMENT (EXP.)                      DATA
hierarchy among them or using bucketization: similar values are
                                                                                              NAME      VALUE        EXP ID                          EXP ID                   EXP ID   DATA NAME       VALUE
associated with the same bucket [2]. Given an entry 𝑟 of the model                                                                                                INTER. ID


M regarding an attribute 𝐴𝑖 ∈ 𝐴,           ˆ it has a corresponding value of                 Reactor     PFR           1                               1             50         1      temperature    [1, 2, 3]


𝑣 𝐴𝑖 ,𝑟 = (𝑣 1,𝑖 , ..., 𝑣𝑑𝐴 ,𝑖 ) for the attribute 𝐴𝑖 where 𝑣𝑖,𝑗 = 1 if 𝑟 has               Exp. Type    IDT           1                               2             60         1         pressure   [10, 20, 30]
                           𝑖
the corresponding categorical value for the attribute 𝐴𝑖 otherwise                           Reactor     RCM           2                                                        2      temperature    [4, 5, 6]

is 0. In this way, it is possible to register an array field of the model                    IDT Type   d/dt OH        2                                                        2           IDT      [11, 22, 33]

where an entry can assumed multiple categorical values for the
same attribute. We use the notation 𝑣 𝐴𝑖 ,𝑟 [𝑘] to denote the 𝑘-th                                                RULE                               INTERPRETER (INTER.)                 MAPPING


value of the attribute 𝐴𝑖 with 𝑘 ∈ [1, 𝑑𝐴𝑖 ] for the entry 𝑟 .                                MODEL        NAME            VALUE     INTER. ID        INTER. ID
                                                                                                                                                                      ...
                                                                                                                                                                              INTER. ID      TYPE    DATA NAME


    Second, we define a multidimensional space that reflects our                                META      Reactor           PFR         50                 50         ...        50         X-Axis    pressure


database’s coverage among the 𝐴𝑠 set of attributes with cardinality                             META     Exp. Type          IDT         50                 60         ...        50         Y-Axis   temperature

|𝐴𝑠 | = 𝑠. Each characteristic 𝐴𝑖 ∈ 𝐴ˆ defines a dimension of the                               META      Reactor           RCM         60                                       60         Y-Axis       IDT

space of size 𝑑𝐴𝑖 . We then create a matrix, called coverage matrix                             META     IDT Type          d/dt OH      60                                       60         X-Axis   temperature

CM , with dimension 𝑑 CM = 𝑑𝐴1 × ... × 𝑑𝐴𝑠 to represent this space.
    Finally, after initializing all the matrix cells to 0, for every entry
𝑟 in the model M, for every possible combination of categorical                                   Figure 4: An example of the rule-based interpretation.
values of the attributes, we update the coverage matrix using Equa-
tion (1) only if it holds the condition in Equation (2) for 𝑟 when
                                                                                            reason, if we use a similarity index that quantifies the difference
𝑖𝑚 ≠ 0 with 𝑚 ∈ [1, 𝑠].
                                                                                            between the predicted data to the experimental data, we can au-
                            CM [𝑖 1, ..., 𝑖𝑠 ] += 1                         (1)             tomatically identify an experiment that has a behavior somewhat
                    𝑣 𝐴1 ,𝑟 [𝑖 1 ] == ... == 𝑣 𝐴𝑠 ,𝑟 [𝑖𝑠 ] == 1              (2)            different from the other similar experiments. It will then be the sci-
                                                                                            entist who establishes what happened case by case, invalidating the
    The final result is a density matrix that represents the coverage
                                                                                            experiment, if necessary, through the metadata of the state. Once
of our database regarding some given categorical attributes. Imme-
                                                                                            an iteration of the simulation-analysis-cleaning loop is terminated,
diately, we can define a database coverage index: after examining
                                                                                            the cycle can start over, and the attention is moved over another
all the entries 𝑟 present in the dataset D, we can count the number
                                                                                            experiment. Section 5 presents examples on data cleaning, database
of cells with value bigger than a given threshold 𝑇 , and normalize
                                                                                            coverage and semantic interpretation.
this value on the total number of cells (Equation (3)).
         Í
           𝑖 ∈ [1,𝑑𝐴1 ],...,𝑘 ∈ [1,𝑑𝐴𝑠 ] 1, 𝑖 𝑓 𝐶 M [𝑖, ..., 𝑘] ≥ 𝑇                         5          DISCUSSION
    C=                                                              ∈ [0, 1] (3)
                                     𝑑 CM                                                   The backbone of automation in a scientific data management system
                                                                                            is the ability to understand the semantic of an experiment. In our
4.3     Data cleaning                                                                       case study means distinguishing the x-axis from the y-axes and
Data-driven applications are sensitive to data quality, but in do-                          correctly pairing the experimental properties with the simulated
mains where the experimental data are rare and affected by non-                             data. In Figure 4 there is an example of the Interpreter assignment to
negligible uncertainty, it is hard to define and measure the quality                        two experiments based on rules. Interpreter with ID 50 is assigned
level of an experiment based on which accept or reject the insertion                        to experiment with ID 1. In fact, all the rules specified by this
in the repository. As discussed in Section 3, the process that we have                      interpreter are fulfilled by the experiment. Then, thanks to the
identified tries to mitigate three different data quality dimensions:                       interpreter, we are able to recognize the x-axis and the y-axis of
consistency, completeness, and accuracy. The domain-specific auto-                          the experimental data.
matic checks, for example, ensure consistency, examining that the                               SciExpeM has a database of about 500 experiments, which, as
unit of measurement of a property is valid. Instead, in the verifica-                       described in Section 4.2, have been categorized based on two meta-
tion step, the scientist completes the empty mandatory metadata                             data as suggested by domain experts: temperature and pressure.
of the experiment. The accuracy of experimental data affected by                            Specifically, the temperature is tokenized from a min of 500 K to a
uncertainty is hard to quantify, but the combination of a human-                            max of 2000 K in steps of 25 degrees. Instead, the pressure goes from
in-the-loop, the predictive model, and a similarity index can help                          0 to 40 bar with step 10. The coverage index using as threshold 1 is
in this task. The predictive model has its own uncertainty, for this                        0.88. Instead, if 3 and 5 are the thresholds, the coverage index is 0.55
                                                                                     5
                                                                                     (a) Before an iteration of the analysis-improvement loop. In
                                                                                     red a possible outlier. In this case the data was evaluated by
                                                                                     an expert as unreliable.


Figure 5: An heatmap representing the density of the cover-
age matrix.


and 0.32, respectively. Figure 5 shows the density of the coverage
matrix CM , which is used to calculate these indices.
    Through the interpreter, SciExpeM can simulate an experiment
with different models and compare the results.                                       (b) After excluding the unreliable data from the database, we
    With model validation, given a domain-specific similarity mea-                   re-analyze the same set of data, highlighting other possible
sure, we measure the predictive performance of a model against                       sources of information/errors.
a set of experimental data. The analyses of the similarity scores
after the model validation provide essential information for the                 Figure 6: Heatmap visualization of the outlier detection in-
model improvement and can also be used to improve the quality of                 side the human-in-the-loop process. On the y-axis different
the repository itself. In fact, we can also use the predictive model             models, on the x-axis different experiments. The heatmap
capabilities to perform data cleaning. A rule-based approach for                 value depicts the Curve Matching score.
data cleaning is already implemented, and it is focused on syntax
or semantic rules on attributes of the database model, but it is not
powerful enough to understand if the measurements contained
inside the experimental data are reliable.                                       heatmap color is rescaled accordingly, to depict that the attention
    We combine the use of the predictive model with some automatic               will be on different experiments in the next iteration.
statistical investigation tools to detect outliers [12]. For this task, we
leverage the categorization of experiments described in Section 3:
it is reasonable to think that the prediction performance of a model             6    CONCLUDING REMARKS
over a set of data belonging to the same category, i.e., the same                In this work, we have presented the problems and proposed solu-
portion of the domain, is similar. A significant deviation from the              tions for managing a complex database that represents an experi-
average of the similarity index of a simulation is a bell for a possible         mental domain. As in many cases, creating a scientific repository is
outlier. As we said in Section 3, each entry of the database has                 not the final goal, but it is preliminary to extract value from the data.
metadata that specify its status. If an entry is a possible outlier,             For this purpose, we have created a human-in-the-loop process in
we automatically tag it with a specific label, in the status that                which the users have different tasks. First, to solve the heterogeneity
alerts the human-in-the-loop that a further inspection is required.              of the data, using general metadata as additional model attributes
This procedure verifies if the model is wrong, providing clues for               and with the help of the users, we can categorize and distinguish
the model improvement or for assessing the unreliability of the                  which portion of the domain is precisely represented by the experi-
experimental data. In the latter case, the entry status is changed to a          ments. Second, using a rule-based procedure, we can automatically
specific value, invalid, that implies to exclude it to further analysis,         understand the semantic of experimental data. This information is
but the experiment must still be there to exclude re-entering it in              essential for the following automatic analyses. Finally, as in many
the repository in the future.                                                    validation scenarios, the reliability of the prediction accuracy de-
    In our case study, we use Curve Matching [4], a similarity index             pends on the coverage of the test set. For this purpose, we develop
of two curves: one is the experimental data, the other one is pre-               a general coverage index, that given a set of model attributes that
dicted data by the model. In Figure 6, it is possible to observe one             define the domain space, quantifies the domain coverage of the
iteration of the continuous loop of analysis-improvement of the                  database. Besides, we can combine this information with a statisti-
model in which both the model and the experimental database can                  cal investigation. Given a similarity measure and human support,
be improved. In this specific case, an unreliable experiment is iden-            we can establish if an experiment outlier is a source of information
tified (Figure 6a) and then excluded from the following iterations               for the predictive model improvement or unreliable experimental
(Figure 6b) after a deeper analysis of the scientist. In Figure 6b the           data, thus improving the overall database quality.
                                                                             6
ACKNOWLEDGMENTS                                                                                     & Knowledge Management. IEEE, 300–304.
                                                                                               [24] Carol Tenopir, Elizabeth D Dalton, Suzie Allard, Mike Frame, Ivanka Pjesivac,
The work of E.R. is supported by the interdisciplinarity PhD project                                Ben Birch, Danielle Pollock, and Kristina Dorsett. 2015. Changes in data sharing
of Politecnico di Milano.                                                                           and data reuse practices and perceptions among scientists worldwide. PloS one
                                                                                                    10, 8 (2015).
                                                                                               [25] Tamás Varga, T Turányi, E Czinki, T Furtenbacher, and A Császár. 2015. ReSpecTh:
                                                                                                    a joint reaction kinetics, spectroscopy, and thermochemistry information system.
REFERENCES                                                                                          In Proceedings of the 7th European Combustion Meeting, Vol. 30. 1–5.
 [1] Chris Allan et al. 2012. OMERO: flexible, model-driven data management for                [26] Bryan W Weber and Kyle E Niemeyer. 2018. ChemKED: A Human-and Machine-
     experimental biology. Nature Methods 9, 3 (2012), 245–253.                                     Readable Data Standard for Chemical Kinetics Experiments. International Journal
 [2] Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying                  of Chemical Kinetics 50, 3 (2018), 135–148.
     coverage for a given dataset. In 2019 IEEE 35th International Conference on Data          [27] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Apple-
     Engineering (ICDE). IEEE, 554–565.                                                             ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino
 [3] Carlo Batini and Monica Scannapieco. 2016. Data and Information Quality -                      da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for
     Dimensions, Principles and Techniques. Springer. https://doi.org/10.1007/978-3-                scientific data management and stewardship. Scientific Data 3, 1 (2016), 1–9.
     319-24106-7
 [4] Mara Sabina Bernardi, Matteo Pelucchi, Alessandro Stagni, Laura Maria Sangalli,
     Alberto Cuoci, Alessio Frassoldati, Piercesare Secchi, and Tiziano Faravelli. 2016.
     Curve matching, a generalized framework for models/experiments comparison:
     An application to n-heptane combustion kinetic mechanisms. Combustion and
     Flame 168 (2016), 186–203.
 [5] Louardi Bradji and Mahmoud Boufaida. 2011. A rule management system for
     knowledge based data cleaning. Intelligent Information Management 3, 6 (2011).
 [6] Kamaljit Chowdhary and Paul Dupuis. 2013. Distinguishing and integrating
     aleatoric and epistemic variation in uncertainty quantification. ESAIM: Mathe-
     matical Modelling and Numerical Analysis 47, 3 (2013), 635–662.
 [7] Marina Drosou, HV Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017.
     Diversity in big data: A review. Big data 5, 2 (2017), 73–84.
 [8] Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio,
     Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda
     Paja, Barbara Pernici, and Jakob Rehof. 2021. Knowledge-driven Data Ecosystems
     Towards Data Transparency. arXiv:2105.09312 [cs.DB]
 [9] Zhiqiang Gong, Ping Zhong, and Weidong Hu. 2019. Diversity in machine
     learning. IEEE Access 7 (2019), 64323–64350.
[10] Jane Greenberg, Hollie C White, Sarah Carrier, and Ryan Scherle. 2009. A meta-
     data best practice for a scientific data repository. Journal of Library Metadata 9,
     3-4 (2009), 194–212.
[11] Francesco Guerra, Paolo Sottovia, Matteo Paganelli, and Maurizio Vincini. 2019.
     Big data integration of heterogeneous data sources: the re-search alps case study.
     In 2019 IEEE International Congress on Big Data (BigDataCongress). IEEE, 106–110.
[12] Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies.
     Artificial intelligence review 22, 2 (2004), 85–126.
[13] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Gold-
     berg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Pro-
     ceedings of the VLDB Endowment 9, 12 (2016).
[14] Victor R Lambert and Richard H West. 2015. Identification, correction, and
     comparison of detailed kinetic models. In 9th US Natl Combust Meeting, Cincinnati,
     OH.
[15] Yin Lin, Yifan Guan, Abolfazl Asudeh, and HV Jagadish. 2020. Identifying insuffi-
     cient data coverage in databases with multiple relations. Proceedings of the VLDB
     Endowment 13, 12 (2020), 2229–2242.
[16] Luigi Marini et al. 2018. Clowder: Open Source Data Management for Long Tail
     Data. In Proceedings of the Practice and Experience on Advanced Research Comput-
     ing (Pittsburgh, PA, USA) (PEARC ’18). Association for Computing Machinery.
[17] Carsten Olm, István Gy Zsély, Róbert Pálvölgyi, Tamás Varga, Tibor Nagy, Henry J
     Curran, and Tamás Turányi. 2014. Comparison of the performance of several
     recent hydrogen combustion mechanisms. Combustion and Flame 161, 9 (2014),
     2219–2234.
[18] Barbara Pernici, Francesca Ratti, and Gabriele Scalia. 2021. About the Quality of
     Data and Services in Natural Sciences. Springer International Publishing, Cham,
     236–248. https://doi.org/10.1007/978-3-030-73203-5_18
[19] Edoardo Ramalli, Gabriele Scalia, Barbara Pernici, Alessandro Stagni, Alberto
     Cuoci, and Tiziano Faravelli. 2021. Data ecosystems for scientific experiments:
     managing combustion experiments and simulation analyses in chemical engi-
     neering. Accepted for publication on Frontiers in Big Data (2021).
[20] Yuji Roh, Geon Heo, and Steven Euijong Whang. 2021. A survey on data collection
     for machine learning: a Big Data-AI integration perspective. IEEE Transactions
     on Knowledge and Data Engineering 33 (2021), 1328–1347.
[21] Gabriele Scalia, Colin A Grambow, Barbara Pernici, Yi-Pei Li, and William H
     Green. 2020. Evaluating scalable uncertainty estimation methods for deep
     learning-based molecular property prediction. Journal of chemical information
     and modeling 60, 6 (2020), 2697–2717.
[22] Gabriele Scalia, Matteo Pelucchi, Alessandro Stagni, Alberto Cuoci, Tiziano
     Faravelli, and Barbara Pernici. 2019. Towards a scientific data framework to
     support scientific model development. Data Science 2, 1-2 (2019), 245–273.
[23] Fatimah Sidi, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A
     Jabar, Hamidah Ibrahim, and Aida Mustapha. 2012. Data quality: A survey of
     data quality dimensions. In 2012 International Conference on Information Retrieval
                                                                                           7