<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life &amp; Earth Sciences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Genoveva Vargas-Solar</string-name>
          <email>genoveva.vargas-solar@cnrs.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jérôme Darmont</string-name>
          <email>jerome.darmont@univ-lyon2.fr</email>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Adorjan</string-name>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier A. Espinosa-Oviedo</string-name>
          <email>javier.espinosa@liris.cnrs.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmem Hara</string-name>
          <email>carmemhara@ufpr.br</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabine Loudcher</string-name>
          <email>sabine.loudcher@univ-lyon2.fr</email>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Regina Motz</string-name>
          <email>rmotz@fing.edu.uy</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Musicante</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José-Luis Zechinelli-Martini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNRS</institution>
          ,
          <addr-line>Univ. Lyon, INSA Lyon, UCBL, LIRIS, UMR5205, F-69221</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CPE Lyon</institution>
          ,
          <addr-line>43 Blvd. du 11 Novembre 1918, 69616 Villeurbanne Cedex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fundación Universidad de las Américas, Puebla Exhacienda Sta.</institution>
          <addr-line>Catarina Mártir s/n 72820 San Andrés Cholula</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Instituto de Computación (INCO) Facultad de Ingeniería, Universidad de la Repúbica</institution>
          ,
          <country country="UY">Uruguay</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universidad Federal Rio Grande do Norte, DIMAP</institution>
          ,
          <addr-line>Natal</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Universidade Federal do Paranà, Dept. de Informatica</institution>
          ,
          <addr-line>Curitiba - PR, 81531-980</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Université de Lyon</institution>
          ,
          <addr-line>Lyon 2, UR ERIC 5 avenue Mendès France, 69676 Bron Cedex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff7">
          <label>7</label>
          <institution>Unversidad ORT</institution>
          ,
          <addr-line>Montevideo</addr-line>
          ,
          <country country="UY">Uruguay</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This vision paper introduces a pioneering data lake architecture designed to meet Life &amp; Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximise scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery. The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique "research-in-the-loop" methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life &amp; Earth sciences to solve some of our time's most critical environmental and biological challenges.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Life and Earth sciences</kwd>
        <kwd>data-driven experiments</kwd>
        <kwd>data lake</kwd>
        <kwd>data curation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>These days, it is relatively easy and inexpensive to ac</title>
        <p>quire massive amount of data, even in continuous mode.</p>
        <p>This has been no diferent for experimental and
observational sciences like Life &amp; Earth sciences. Accessibility
to data about the Earth and its biodiversity, with varying
levels of provenance, quality and reliability, opens up the
possibility of constructing diferent perspectives on the
phenomena observed, leading to scientific conclusions
with diferent depths that target a wide range of
knowledge consumers (civilians, decision-makers, scientists).</p>
        <p>Traditional schema-on-write approaches, such as the
Extraction, Transformation and Loading (ETL) process,
are inefective for the data management requirements of
these experimental sciences. Data lakes are becoming
increasingly common for the management and analysis
of massive data. Data lakes are repositories that store raw
data in its original format. They can be well adapted for
storing data harvested from digital sources (observation
stations), social media, Web and in situ collectors.</p>
        <p>The extraction of value through data-driven
experiments in the Life &amp; Earth sciences is determined by two
main elements:
• The maintenance of metadata gathering the
conditions under which experiments are performed
(quantitative perspective) to preserve the
memory of the experimental process of knowledge
production process, and to enable understanding
and reproducibility.
• An open science perspective that can go beyond
data sharing and must consider the sharing of
know-how, decision-making, expertise, project with the ability to collect data based on specific character
management, and people within the projects that sequences. 80legs2 ofers sequential data extraction from
define the research must be considered. websites. Octoparse3 simplifies the data extraction
process by enabling users to create a scraping workflow with</p>
        <p>This vision paper introduces our approach to designing clicks. It includes features like URL and string lists for
and building a data lake for collecting and integrating targeted scraping and ready-to-use templates for popular
data and meta data of Life &amp; Earth sciences’ data-driven sites like Amazon and Google. FactExtract [3] is tailored
experiments. for aggregating content from specific Senegalese news</p>
        <p>The remainder of the paper is organised as follows. sources, boasting automatic language detection for ten
Section 2 gives a general overview of approaches that ad- languages, data cleaning, and analysis, all whilst
avoiddress curating and managing knowledge in Life &amp; Earth ing data duplication. This tool, which utilises Python’s
sciences. Section 3 describes the challenges associated Newspaper library, also features automated daily updates
with curating data and data-driven experiments in Life &amp; for the news content it monitors. ENoW - News Data
ExEarth sciences often guided by researchers. In particular, tractor from the Web4 is a news scrapping system that
exthe section gives the general challenges for building data plores online newspapers. ENoW receives search strings
lakes containing curated data and producing knowledge as input and stores in a relational database data extracted
derived from data-driven experiments. Section 4 intro- from the news and their full content.
duces the general principle for building, maintaining and
exploiting a data lake. The data lake allows the creation 2.2. Data curation
of "dataverses" that can export the history of the
development of experimental processes that lead to knowledge According to Garcov et al., [4], research data curation
in Life &amp; Earth sciences. Finally, Section 5 concludes the is “preparing research data and artefacts for sharing and
paper and discusses future work. long-term preservation”. Research repositories are the
standard for publishing data collections to the research
communities. Datasets at an early collection stage are
2. Related work generally not ready for analysis or preservation. Thus,
We introduce the main topics and approaches that un- extensive preprocessing, cleaning, transformation, and
derline the vision of maintaining and sharing data to documentation actions are required to support usability,
perform data-driven experiments: data harvesting tools, sharing, and preservation over time [5]. Curated data
data curation techniques, data labs, data lakes, science collections have the potential to drive scientific progress
lakes and dataverses. [6], are relevant for reproducibility and improve the
reliability of sciences [7]. However, data curation introduces
challenges for supporting data-driven applications [8]
2.1. Data harvesting adopting quanti-qualitative methods. For example,
reData available on the Web play a determining role in search challenges curating material across time, space
decision-making in personal and corporate life. Collect- and collaborators [7]. Quantitative and qualitative
reing and storing this data in a structured model helps inte- search methodologies apply ad-hoc data curation
strategrate them with other sources and use the dataset in var- gies that keep track of the data that describe the tools,
ious applications, such as event detection and sentiment techniques, hypothesis, and data harvesting criteria
demonitoring. Online newspapers are essential sources of ifned a priori by a scientific team.
information, accessed daily by thousands of people. Several software tools that apply statistical techniques</p>
        <p>Various works in the literature report manual eforts to and machine learning algorithms are available for
qualiextract data from pages on the Web [1, 2]. However, these tative researchers. Woods et al. [9] argue that
Computereforts have been eased by applying Web scraping tech- Assisted Qualitative Data Analysis Software (CAQDAS)
niques. Some work complements automated extraction is a well-known tool for qualitative research. These tools
processes to obtain clean and analysed data by imple- support qualitative techniques and methods for
applymenting curation procedures [3]. Among the various ing Qualitative Data Analysis (QDA). ATLAS.ti [10],
Deexisting tools available on the Web for data extraction, doose [11], MAXQDA [12], NVivo [13] implement the
we can highlight ParseHub1 is a web scraping tool that REFI-QDA standard, an interoperability exchange
forfacilitates data extraction from websites through an
interactive click-based interface, saving the data directly to
the cloud in JSON and CSV formats. It navigates through
continuation pages and captures complete news articles,</p>
      </sec>
      <sec id="sec-1-2">
        <title>2https://80legs.com/</title>
        <p>3https://www.octoparse.com/
4L Reips, M Musicante, G Vargas-Solar, ATR Pozo, C.S Hara,
ENoWExtrator de Dados de Notícias da Web, Demonstration Anais
Estendidos do XXXVIII Simpósio Brasileiro de Bancos de Dados, 2023,
78-83</p>
      </sec>
      <sec id="sec-1-3">
        <title>1https://www.parsehub.com/</title>
        <p>mat. CAQDAS [14] researchers and practitioners can thereby facilitating the discovery of high-quality data
perform annotation, labelling, querying, audio and video across diferent scientific fields.
transcription, pattern discovery, and report generation.</p>
        <p>Furthermore, CAQDAS tools allow the creation of field 2.4. Data lake, science lake and dataverse
notes, thematic coding, search for connections, memos
(thoughtful comments), contextual analysis, frequency Data lakes are expansive storage repositories that hold
analysis, word location and data analysis presentation vast raw data in their native format until needed. Stein
in diferent reporting formats [ 15]. The REFI-QDA (Rot- and Morrison [20] emphasised their potential for
scalaterdam Exchange Format Initiative)5 the standard allows bility and flexibility in handling big data from various
the exchange of qualitative data to enable reuse in QDAS sources. In recent studies, Dixon in 201010 defined the
[16]. QDA software such as ATLAS.ti [10], Dedoose [11], term and its initial application in big data analytics. Quix
MAXQDA [12], NVivo [13], QDAMiner [17], Quirkos et al. (2016) [21] delved into the architectural
consid[18] and Transana [19] adopt REFI-QDA standard. erations and challenges such as data governance and</p>
        <p>We assume that data curation consists of identifying, metadata management.
systematizing, managing, and versioning research data, Science lakes, an ofshoot of data lakes, are tailored
considering versioning artefacts an essential component specifically for the scientific community to address the
of tracking changes along the research project. need for interdisciplinary research, data management
and complex analytics. Russom (2016) [22] suggested
2.3. Data labs that science lakes provide a more discipline-specific
approach to data handling, enabling better metadata
curaData science environments provide data labs like Kag- tion and domain-specific data models, which are crucial
gle6 and Dryad7 with stacks of services for (externalised) for reproducibility in scientific research.
data storage, tagging and exploring tools. These environ- A data lake is a vast storage system that houses
extenments allow a collective sharing space of highly curated sive volumes of raw data in its original format. This
verdata collection maintenance tools. There are specialised satile system accommodates a range of data types,
includrepositories like DataOne8 and data repositories re3data9. ing structured, semi-structured, and unstructured forms.</p>
        <p>DataONE (Data Observation Network for Earth) is a Data lakes are essential in environments focused on big
community-driven project that provides access to various data analytics and are designed to manage data
characenvironmental and ecological data across multiple mem- terised by large volume, high velocity, and diverse variety
ber repositories. It is designed as an innovative frame- from multiple sources. They are commonly utilised for
adwork aimed at facilitating research and enabling scien- vanced data processing activities such as machine
learntists and researchers to preserve, access, use, and increase ing and predictive analytics. Unlike traditional databases
the impact of their data. The platform provides robust following the schema-on-write approach, data lakes
foldata management tools, ensuring datasets’ preservation low the schema-on-read approach, providing flexibility
and integrity. DataONE underscores data stewardship in how data is formatted and used.
as a federated resource and supports scientific
collaboration and reproducibility. It is invaluable for researchers Dataverse. The concept of dataverse takes the
noseeking to address complex environmental challenges tion of data lakes further by creating a networked space
through shared data and knowledge. where data is stored, actively managed, and shared within</p>
        <p>Re3data is a global registry of research data reposito- the scientific community. A dataverse is a data
reposries that ofers a comprehensive directory for researchers itory platform for publishing, citing, and discovering
seeking to access, store, share, and manage their datasets. datasets. It enables researchers to publish, cite, and
disIt represents a variety of academic disciplines and pro- cover datasets while providing metadata and tools to
vides detailed information about each repository, such ensure others can understand and use data. Dataverses
as access policies, standards, and contact details. re3data are often domain-specific and support the principles of
promotes data sharing, visibility, and reuse as a critical open science, providing features such as data version
reference point for finding suitable repositories for data control, digital object identifiers (DOIs) for citation, and
deposition. The platform enhances transparency in re- tools for data analysis within the platform. They are
search data management. It supports open science by community-driven and emphasize the accessibility and
guiding users to trustworthy and reliable repositories, reusability of research data.</p>
        <p>
          The most prominent example is the open-source
Dataverse project developed by the Institute for
Quantitative Social Science at Harvard University. The Dataverse
5https://www.qdasoftware.org
6kaggle.com
7https://datadryad.org/stash
8https://www.dataone.org/about/
9https://www.re3data.org
Project, initiated by King [
          <xref ref-type="bibr" rid="ref1">23</xref>
          ], provides an open-source These repositories support open science by promoting
platform for sharing, preserving, citing, exploring, and data sharing across disciplinary boundaries. This
feaanalysing research data. It focuses on data citation and ture enables researchers to replicate studies and build
reproducibility, as discussed by Crosas [
          <xref ref-type="bibr" rid="ref2">24</xref>
          ], who high- upon existing work, which is fundamental for advancing
lighted the platform’s role in fostering collaboration and knowledge. They also facilitate interdisciplinary
collaboopen science. ration, allowing experts from diferent fields to contribute
        </p>
        <p>Diferent academic institutions have built their data- to and draw from a collective data pool. For instance, a
verses for sharing and disseminating experimental sci- dataverse in these fields might include a combination of
entific results, including the data collections they curate: high-throughput experimental data, field observations,
University of Arizona11, the Diferent universities and and simulation outputs. The combination of openness
academic institutions have promoted their dataverses and rigorous data management positions dataverses as
like the University of Hamburg12, the University of Michi- critical resources in pursuing scientific discovery in Life
gan13 and the Grenoble Dataverse14. &amp; Earth sciences.</p>
        <p>In life and earth sciences, data lakes are pivotal for
conSummary. Together, these systems represent a shift solidating scientific data collected from various
biodivertoward more open, integrated, and eficient ecosystems sity studies and geological events like earthquakes. Once
for data management, ofering novel solutions to the curated, processed, and analysed, this data contributes
challenges posed by the vast amounts of data generated significantly to data-driven experiments underpinned by
in modern research. They move away from traditional well-established protocols. The harvested data enriches
databases and toward more fluid, dynamic systems that the data lake and supports the creation of detailed,
cucan accommodate the ever-changing landscape of big rated views for dissemination through dataverses.
data and scientific research. Our vision emphasises the importance of developing</p>
        <p>A dataverse and a data lake are concepts related to data and maintaining data lakes with partially curated
constorage and management but serve diferent purposes and tent in life and earth sciences, facilitating the continuous
are designed with varying cases of use in mind. While cycle of experimental data feeding back into the lake and
a dataverse is a scholarly platform aimed at curating, subsequently sharing via dataverses.
sharing, and preserving research data with rich metadata
and community collaboration features, a data lake is a
more generalised and scalable storage solution for raw
data to support diverse data analytics and processing
workflows.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Maintaining and sharing earth and life sciences knowledge: challenges</title>
      <p>
        2.5. Data lakes and data verses in Life &amp; Various data on life and earth sciences have been
acEarth sciences quired from diferent sources [
        <xref ref-type="bibr" rid="ref3">25</xref>
        ]. Integrated access to
data collections and their curated versions can facilitate
Dataverses in Life &amp; Earth sciences are specialised digital their maintenance, analysis and experimentation. It can
infrastructures designed to address specific data manage- also demonstrate knowledge of the discipline with its
voment needs for these scientific domains. They provide a cabulary, concepts and relationships in a synthetic way.
structured yet flexible environment where datasets can Curation, maintenance and exploration of data
collecbe stored, accessed, shared, and analysed. These data- tions in the data lake calls for proposing techniques for
verses typically ofer robust metadata standards and tools exploring data collections that can be explored and
ento ensure their data are well-described, making them dis- riched while producing new data and analytical results.
coverable and usable for various research purposes. Data curation also means keeping track of the type of
      </p>
      <p>
        In Life Sciences, dataverses often focus on genomics, experiments carried out on the data, their results and the
proteomics, clinical trials, and other biological data, in- conditions under which they were carried out.
Maintaintegrating various sources of information to aid in com- ing a catalogue of data-related questions and experiments
plex analyses like phenotype-genotype correlations. For can promote open science, share data and knowledge, and
Earth Sciences, dataverses might concentrate on geospa- share the data and knowledge the scientific community
tial data, climate models, seismic activity records, and has gained from it [
        <xref ref-type="bibr" rid="ref4">26</xref>
        ]. This information should also be
ecological data, supporting eforts to understand and stored in the data lake.
model the Earth’s dynamic systems.
11https://arizona.figshare.com
12https://www.fdm.uni-hamburg.de/en/fdm.html
13https://www.icpsr.umich.edu/web/about/cms/2365
14https://scienceouverte.couperin.org/cellule-data-grenoble-alpes/
Challenge 1: How to structure and organise life and
earth sciences metadata? Metadata modelling is a
way of structuring and organising earthquakes and
biodiversity. The metadata model must make the content of a
data lake findable, accessible, interoperable and reusable
(FAIR principles [
        <xref ref-type="bibr" rid="ref5">27</xref>
        ]). Metadata can represent the data’s
structural, semantic and contextual aspects (provenance,
conditions and assumptions under which the analytical
results are obtained, i.e., the metadata driving the
analysis). Most proposed models are based on logic or
structured by graphs [
        <xref ref-type="bibr" rid="ref6 ref7">28, 29</xref>
        ] that can be specialised in seismic
geophysical data and biodiversity. Besides, associating
metadata can be achieved by considering quantitative
and qualitative perspectives through data curation.
Combining quantitative and qualitative approaches allows
for a meta-model of the content used and produced in
experiments and the conditions in which the content is
produced, chosen, validated and considered
representative knowledge for the domain of study.
Figure 1 illustrates the principle of our vision concerning
the way a life and earth sciences data lake can be built,
maintained and exploited. Our approach is based on the
quantitative and qualitative curation of data harvested
digitally and in situ (left-hand side of the figure).
Heterogeneous raw data is gathered and stored in the data lake.
      </p>
      <p>Then, algorithms (statistical and Artificial Intelligence)
and researchers can process, filter and classify data. This
ifltering process produces and stores meta-data in the
data lake. Data exploration and integration (cleaning and
engineering) processes can be performed on data samples
from the data lake. They can be used for experimental
purposes to produce content associated with the data
stored in the data lake. Clean and curated data
associated with meta-data representing the quantitative and
qualitative perspective of the experiments can then be
shared in a data verse (right-hand side of the figure).</p>
      <p>
        Challenge 2: How to integrate data in the data lake?
Since the experiments require several data collections,
integrating the data into the data lake must be part of
a pipeline that includes data discovery, exploration,
selection and integration. This process should be designed
based on the requirements of life and earth science
experiments [
        <xref ref-type="bibr" rid="ref3">25</xref>
        ]. The heterogeneity of the data (text, signals, Harvested data, models and knowledge
integramultimedia, proprietary formats from seismographs), the tion. Various life and earth sciences data have been
speed of the data often produced in the form of streams in harvested from diferent sources. Since they are
heterothe case of seismic sensors in addition to the volume are geneous and produced at diferent paces (continuous and
aspects that require original contributions in the design, in batch), our approach proposes an integration approach
maintenance and exploration of the data lake. based on a pivot meta-representation. The principle is
to present a general meta-model of their content and
process them for extracting technical, structural and
semantic meta-data. This abstract representation provides
integrated access to data collections and curated versions
under a global knowledge graph and can promote their
maintenance, analysis, and experimentation. It can also
show the knowledge of the discipline with its
vocabulary, concepts, and relations in a synthetic manner. The
data lake can be pivotal in collecting, processing, and
exporting raw data in a curated view.
      </p>
      <p>
        Challenge 3: How to integrate data in the data lake
considering scientists’ needs? The researcher’s
intervention, defined as a researcher-in-the-loop (RITL)
[
        <xref ref-type="bibr" rid="ref8">30</xref>
        ], is a crucial aspect of human intervention to assess
content concerning (i) the conditions in which it is
produced and (ii) to make decisions about the new tasks
to perform and the way a research project will move
forward. RITL is a case of Human-in-the-loop (HITL),
where the primary output of the process is a selection
of the data, not a trained machine-learning model. HITL
is crucial for handling supervision, exception control,
optimisation, and maintenance [
        <xref ref-type="bibr" rid="ref10 ref9">31, 32</xref>
        ]. Under a RITL
approach, a human sees all data points in the relevant
selection at the end of the process. Using RITL requires
a systematic solid way of working15. This characteristic
is critical for designing content curation for quantitative
and qualitative research methods.
      </p>
      <p>
        Scientific content should be extracted and computed,
including data, analytics tasks (manual and AI models),
and associated metadata. This curated content allows the
produced knowledge to be reusable and analytics results
to be reproducible [
        <xref ref-type="bibr" rid="ref11">33</xref>
        ], thereby adhering to the FAIR
principles [
        <xref ref-type="bibr" rid="ref12">34</xref>
        ].
      </p>
      <p>Curation, maintenance, and exploration of data
collections for bringing data value from in situ
observations and experiments. Since data acts as a
backbone in modelling phenomena for understanding
their behaviour, it is critical to developing good
collection and maintenance: which are available data
collections? Are they complete? Which is their provenance?
In which conditions were they collected? Have they been
processed? In which cases have they been used, and what
are the associated results? We propose techniques to
explore data collections using graphs that can be explored
and enriched while new data and analytics results are
produced. Data curation also means keeping track of the
type of experiments run on data, their results, and the
15https://hai.stanford.edu/news/humans-loop-design-interactive-ai-syscteomnsditions in which they were performed. Maintaining a
catalogue of data-related questions and experiments can
promote open science and share data and the knowledge
that the scientific community has derived from it.</p>
      <p>Modelling and simulating experiments to answer
questions in life and earth sciences. Answering
research questions through data-driven experiments
implies:
• Designing ad hoc experiment artefact models and
programming languages for enabling friendly,
context-aware, and declarative construction of
experiments in life and earth sciences.
• Collecting execution of experiments data (raw
input data, prepared datasets, experiments’ tasks
calibration and associated results).</p>
      <p>Pilot experiments. The data lake will be tested in real
scenarios through collaboration with domain experts in
seismology and biodiversity studies in Brazil. The entry
point will be two pilot experiments, namely:
1. the classification process of seismic signals
collected by stations through diferent observations
to detect "natural" and human-made earthquakes
in the northern human-made earthquakes in the
northern region of Brazil;
2. the classification of in situ observations of the
"carabela portuguesa"16 and modelling its
behaviour on the Brazilian coast.
16The Portuguese caravel (Physalia physalis) is a monotypic colonial
species of siphonophore hydrozoan of the family Physaliidae. It
is commonly found in the open ocean in all warm waters of the</p>
      <p>In both cases, it is necessary to (i) apply statistical
methods to investigate and unveil new patterns in seisms
and biodiversity data, answering open problems or
leading to new research questions; (ii) build predictive models
to better describe or approximate phenomena,
increasing the knowledge about our planet. The conditions in
which statistics and prediction are performed, results,
observations, interpretation and validation of the results
are data to be integrated into the data lake.</p>
      <p>Discussion. The originality of the work is to address
the construction of a data lake that includes:
1. Raw collected data representing life and earth
sciences phenomena (streams, batch, multimedia,
proprietary).
2. Data produced along data-driven experiments
adopting data science techniques including
artificial intelligence algorithms (ML-driven data
lakes).
3. Contextual data describing the conditions in
which data are collected, and experiments are
designed and enacted. The data lake will provide
data curation modules for extracting metadata
according to a well-adapted model and modules
exploring data and using them for designing new
experimentations, thereby adopting an open
science perspective.
world, especially in the tropical and subtropical regions of the
Pacific and Indian Oceans, as well as in the Atlantic Gulf Stream.</p>
      <p>Its sting is dangerous and very painful https://es.wikipedia.org/
wiki/Physalia_physalis.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions and future work</title>
    </sec>
    <sec id="sec-4">
      <title>6. Acknowledgements</title>
      <sec id="sec-4-1">
        <title>The work reported in this paper is done in the context of</title>
        <p>the LETITIA17 project, funded by the Fédération
Informatique de Lyon18.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Our vision is that it is necessary to address fundamental research topics at the centre of Data Science, Big Data management and analytics for solving data-driven problems in life and earth sciences.</title>
        <p>The contribution is the design and exploration
techniques of a data lake with a well-adapted model for
metadata about life and earth sciences experiments consuming
and producing quantitative and qualitative data. An
important work will be to define exploration operators and
pipelines to exploit the content for further maintaining
and developing new life and earth sciences experiments.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>An introduction to the dataverse network as an infrastructure for data sharing</article-title>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Crosas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Honaker</surname>
          </string-name>
          , L. Sweeney,
          <article-title>Automating open science for big data</article-title>
          ,
          <source>The ANNALS of the American Academy of Political and Social Science</source>
          <volume>659</volume>
          (
          <year>2015</year>
          )
          <fpage>260</fpage>
          -
          <lpage>273</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [25]
          <string-name>
            <surname>U. S. da Costa</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Espinosa-Oviedo</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Musicante</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Vargas-Solar</surname>
            ,
            <given-names>J.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Zechinelli-Martini</surname>
          </string-name>
          ,
          <article-title>Using provenance in data analytics for seismology: Challenges and directions</article-title>
          ,
          <source>in: European Conference on Advances in Databases and Information Systems</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adorjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Vargas-Solar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motz</surname>
          </string-name>
          ,
          <article-title>Towards a human-in-the-loop curation: A qualitative perspective</article-title>
          ,
          <source>in: 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [27]
          <string-name>
            <surname>M. D. Wilkinson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>I. J.</given-names>
          </string-name>
          <string-name>
            <surname>Aalbersberg</surname>
            , G. Appleton,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Axton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Baak</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Blomberg</surname>
            ,
            <given-names>J.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Boiten</surname>
            ,
            <given-names>L. B. da Silva</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          <string-name>
            <surname>Bourne</surname>
          </string-name>
          , et al.,
          <article-title>The fair guiding principles for scientific data management and stewardship</article-title>
          ,
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Scholly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Sawadogo</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>EspinosaOviedo</surname>
          </string-name>
          , C. Favre,
          <string-name>
            <given-names>S.</given-names>
            <surname>Loudcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Darmont</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Noûs, goldmedal: une nouvelle contribution à la modélisation générique des métadonnées des lacs de données</article-title>
          , Revue des Nouvelles Technologies de l'
          <source>Information</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Diouan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ferey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Loudcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Darmont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Noûs</surname>
          </string-name>
          ,
          <article-title>Métadonnées des lacs de données et principes fair</article-title>
          ,
          <source>in: 18e journées Business Intelligence et Big Data (EDA</source>
          <year>2022</year>
          ),
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [30]
          <string-name>
            <surname>R. Van de Schoot</surname>
          </string-name>
          , J. de Bruin,
          <article-title>Researcher-in-theloop for systematic reviewing of text databases, Zenodo: SciNLP: Natural Language Processing and Data Mining for Scientific Text (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [31]
          <string-name>
            <surname>I. Rahwan</surname>
          </string-name>
          ,
          <article-title>Society-in-the-loop: programming the algorithmic social contract</article-title>
          ,
          <source>Ethics and information technology 20</source>
          (
          <year>2018</year>
          )
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mosqueira-Rey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hernández-Pereira</surname>
          </string-name>
          , D. AlonsoRíos, J. Bobes-Bascarán, Á. Fernández-Leal,
          <article-title>Humanin-the-loop machine learning: A state of the art</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>3005</fpage>
          -
          <lpage>3054</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leipzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nüst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Hoyt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Greenberg</surname>
          </string-name>
          ,
          <article-title>The role of metadata in reproducible computational research</article-title>
          ,
          <source>Patterns</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>100322</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P. P. F.</given-names>
            <surname>Barcelos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Sales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fumagalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Fonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Sousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Romanenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Guizzardi</surname>
          </string-name>
          ,
          <article-title>A fair model catalog for ontology-driven conceptual modeling research, Conceptual Modeling</article-title>
          .
          <source>ER</source>
          <volume>73</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>