=Paper= {{Paper |id=Vol-3651/DARLI-AP_paper13 |storemode=property |title=Dataversifying Earth Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life and Earth Sciences |pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-13.pdf |volume=Vol-3651 |authors=Genoveva Vargas-Solar,Jérôme Darmont,Alejandro Adorjan,Javier-Alfonso Espinosa-Oviedo,Carmem Hara,Sabine Loudcher,Regina Motz,Martin Musicante,José-Luis Zechinelli-Martini |dblpUrl=https://dblp.org/rec/conf/edbt/Vargas-SolarDAE24 }} ==Dataversifying Earth Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life and Earth Sciences== https://ceur-ws.org/Vol-3651/DARLI-AP-13.pdf
                                Dataversifying Natural Sciences:
                                Pioneering a Data Lake Architecture for Curated
                                Data-Centric Experiments in Life & Earth Sciences
                                Genoveva Vargas-Solar1 , Jérôme Darmont2 , Alejandro Adorjan4 , Javier A. Espinosa-Oviedo1,3 ,
                                Carmem Hara5 , Sabine Loudcher2 , Regina Motz6 , Martin Musicante7 and
                                José-Luis Zechinelli-Martini8
                                1
                                  CNRS, Univ. Lyon, INSA Lyon, UCBL, LIRIS, UMR5205, F-69221, France
                                2
                                  Université de Lyon, Lyon 2, UR ERIC 5 avenue Mendès France, 69676 Bron Cedex, France
                                3
                                  CPE Lyon, 43 Blvd. du 11 Novembre 1918, 69616 Villeurbanne Cedex, France
                                4
                                  Unversidad ORT, Montevideo, Uruguay
                                5
                                  Universidade Federal do Paranà, Dept. de Informatica, Curitiba - PR, 81531-980, Brazil
                                6
                                  Instituto de Computación (INCO) Facultad de Ingeniería, Universidad de la Repúbica, Uruguay
                                7
                                  Universidad Federal Rio Grande do Norte, DIMAP, Natal, Brazil
                                8
                                  Fundación Universidad de las Américas, Puebla Exhacienda Sta. Catarina Mártir s/n 72820 San Andrés Cholula, Mexico


                                                                             Abstract
                                                                             This vision paper introduces a pioneering data lake architecture designed to meet Life & Earth sciences’ burgeoning data
                                                                             management needs. As the data landscape evolves, the imperative to navigate and maximise scientific opportunities has never
                                                                             been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a
                                                                             collaborative space conducive to scientific discovery. The core of the design and construction of a data lake is the development
                                                                             of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments.
                                                                             Our unique "research-in-the-loop" methodology ensures that scientists across various disciplines are integrally involved in the
                                                                             curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection
                                                                             to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and
                                                                             impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of
                                                                             Life & Earth sciences to solve some of our time’s most critical environmental and biological challenges.

                                                                             Keywords
                                                                             Life and Earth sciences, data-driven experiments, data lake, data curation



                                1. Introduction                                                                                                                      edge consumers (civilians, decision-makers, scientists).
                                                                                                                                                                        Traditional schema-on-write approaches, such as the
                                These days, it is relatively easy and inexpensive to ac-                                                                             Extraction, Transformation and Loading (ETL) process,
                                quire massive amount of data, even in continuous mode.                                                                               are ineffective for the data management requirements of
                                This has been no different for experimental and observa-                                                                             these experimental sciences. Data lakes are becoming
                                tional sciences like Life & Earth sciences. Accessibility                                                                            increasingly common for the management and analysis
                                to data about the Earth and its biodiversity, with varying                                                                           of massive data. Data lakes are repositories that store raw
                                levels of provenance, quality and reliability, opens up the                                                                          data in its original format. They can be well adapted for
                                possibility of constructing different perspectives on the                                                                            storing data harvested from digital sources (observation
                                phenomena observed, leading to scientific conclusions                                                                                stations), social media, Web and in situ collectors.
                                with different depths that target a wide range of knowl-                                                                                The extraction of value through data-driven experi-
                                                                                                                                                                     ments in the Life & Earth sciences is determined by two
                                Published in the Proceedings of the Workshops of the EDBT/ICDT 2024                                                                  main elements:
                                Joint Conference (March 25-28, 2024), Paestum, Italy.
                                *
                                  Genoveva Vargas-Solar.
                                †                                                                                                                                         • The maintenance of metadata gathering the con-
                                  The authors’ list is alphabetical except for the first two authors.
                                                                                                                                                                            ditions under which experiments are performed
                                $ genoveva.vargas-solar@cnrs.fr (G. Vargas-Solar);
                                jerome.darmont@univ-lyon2.fr (J. Darmont); aadorian@gmail.com                                                                               (quantitative perspective) to preserve the mem-
                                (A. Adorjan); javier.espinosa@liris.cnrs.fr (. J. A. Espinosa-Oviedo);                                                                      ory of the experimental process of knowledge
                                carmemhara@ufpr.br (C. Hara); sabine.loudcher@univ-lyon2.fr                                                                                 production process, and to enable understanding
                                (S. Loudcher); rmotz@fing.edu.uy (R. Motz); mam@dimap.ufrn.br                                                                               and reproducibility.
                                (M. Musicante); joseluis.zechinelli@udlap.mx (J. Zechinelli-Martini)
                                                                       © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons        • An open science perspective that can go beyond
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       License Attribution 4.0 International (CC BY 4.0).
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                              data sharing and must consider the sharing of




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
           know-how, decision-making, expertise, project       with the ability to collect data based on specific character
           management, and people within the projects that     sequences. 80legs2 offers sequential data extraction from
           define the research must be considered.             websites. Octoparse3 simplifies the data extraction pro-
                                                               cess by enabling users to create a scraping workflow with
   This vision paper introduces our approach to designing      clicks. It includes features like URL and string lists for
and building a data lake for collecting and integrating        targeted scraping and ready-to-use templates for popular
data and meta data of Life & Earth sciences’ data-driven       sites like Amazon and Google. FactExtract [3] is tailored
experiments.                                                   for aggregating content from specific Senegalese news
   The remainder of the paper is organised as follows.         sources, boasting automatic language detection for ten
Section 2 gives a general overview of approaches that ad-      languages, data cleaning, and analysis, all whilst avoid-
dress curating and managing knowledge in Life & Earth          ing data duplication. This tool, which utilises Python’s
sciences. Section 3 describes the challenges associated        Newspaper library, also features automated daily updates
with curating data and data-driven experiments in Life &       for the news content it monitors. ENoW - News Data Ex-
Earth sciences often guided by researchers. In particular,     tractor from the Web4 is a news scrapping system that ex-
the section gives the general challenges for building data     plores online newspapers. ENoW receives search strings
lakes containing curated data and producing knowledge          as input and stores in a relational database data extracted
derived from data-driven experiments. Section 4 intro-         from the news and their full content.
duces the general principle for building, maintaining and
exploiting a data lake. The data lake allows the creation
                                                               2.2. Data curation
of "dataverses" that can export the history of the develop-
ment of experimental processes that lead to knowledge          According to Garcov et al., [4], research data curation
in Life & Earth sciences. Finally, Section 5 concludes the     is “preparing research data and artefacts for sharing and
paper and discusses future work.                               long-term preservation”. Research repositories are the
                                                               standard for publishing data collections to the research
                                                               communities. Datasets at an early collection stage are
2. Related work                                                generally not ready for analysis or preservation. Thus,
                                                               extensive preprocessing, cleaning, transformation, and
We introduce the main topics and approaches that un-
                                                               documentation actions are required to support usability,
derline the vision of maintaining and sharing data to
                                                               sharing, and preservation over time [5]. Curated data
perform data-driven experiments: data harvesting tools,
                                                               collections have the potential to drive scientific progress
data curation techniques, data labs, data lakes, science
                                                               [6], are relevant for reproducibility and improve the reli-
lakes and dataverses.
                                                               ability of sciences [7]. However, data curation introduces
                                                               challenges for supporting data-driven applications [8]
2.1. Data harvesting                                           adopting quanti-qualitative methods. For example, re-
Data available on the Web play a determining role in           search challenges curating material across time, space
decision-making in personal and corporate life. Collect-       and collaborators [7]. Quantitative and qualitative re-
ing and storing this data in a structured model helps inte-    search methodologies apply ad-hoc data curation strate-
grate them with other sources and use the dataset in var-      gies that keep track of the data that describe the tools,
ious applications, such as event detection and sentiment       techniques, hypothesis, and data harvesting criteria de-
monitoring. Online newspapers are essential sources of         fined a priori by a scientific team.
information, accessed daily by thousands of people.               Several software tools that apply statistical techniques
   Various works in the literature report manual efforts to    and machine learning algorithms are available for quali-
extract data from pages on the Web [1, 2]. However, these      tative researchers. Woods et al. [9] argue that Computer-
efforts have been eased by applying Web scraping tech-         Assisted Qualitative Data Analysis Software (CAQDAS)
niques. Some work complements automated extraction             is a well-known tool for qualitative research. These tools
processes to obtain clean and analysed data by imple-          support qualitative techniques and methods for apply-
menting curation procedures [3]. Among the various             ing Qualitative Data Analysis (QDA). ATLAS.ti [10], De-
existing tools available on the Web for data extraction,       doose [11], MAXQDA [12], NVivo [13] implement the
we can highlight ParseHub1 is a web scraping tool that         REFI-QDA standard, an interoperability exchange for-
facilitates data extraction from websites through an in-
                                                               2
teractive click-based interface, saving the data directly to     https://80legs.com/
                                                               3
                                                                 https://www.octoparse.com/
the cloud in JSON and CSV formats. It navigates through        4
                                                                 L Reips, M Musicante, G Vargas-Solar, ATR Pozo, C.S Hara, ENoW-
continuation pages and captures complete news articles,          Extrator de Dados de Notícias da Web, Demonstration Anais Esten-
                                                                 didos do XXXVIII Simpósio Brasileiro de Bancos de Dados, 2023,
1
    https://www.parsehub.com/                                    78-83
mat. CAQDAS [14] researchers and practitioners can             thereby facilitating the discovery of high-quality data
perform annotation, labelling, querying, audio and video       across different scientific fields.
transcription, pattern discovery, and report generation.
Furthermore, CAQDAS tools allow the creation of field          2.4. Data lake, science lake and dataverse
notes, thematic coding, search for connections, memos
(thoughtful comments), contextual analysis, frequency          Data lakes are expansive storage repositories that hold
analysis, word location and data analysis presentation         vast raw data in their native format until needed. Stein
in different reporting formats [15]. The REFI-QDA (Rot-        and Morrison [20] emphasised their potential for scala-
terdam Exchange Format Initiative)5 the standard allows        bility and flexibility in handling big data from various
the exchange of qualitative data to enable reuse in QDAS       sources. In recent studies, Dixon in 201010 defined the
[16]. QDA software such as ATLAS.ti [10], Dedoose [11],        term and its initial application in big data analytics. Quix
MAXQDA [12], NVivo [13], QDAMiner [17], Quirkos                et al. (2016) [21] delved into the architectural consid-
[18] and Transana [19] adopt REFI-QDA standard.                erations and challenges such as data governance and
   We assume that data curation consists of identifying,       metadata management.
systematizing, managing, and versioning research data,            Science lakes, an offshoot of data lakes, are tailored
considering versioning artefacts an essential component        specifically for the scientific community to address the
of tracking changes along the research project.                need for interdisciplinary research, data management
                                                               and complex analytics. Russom (2016) [22] suggested
                                                               that science lakes provide a more discipline-specific ap-
2.3. Data labs                                                 proach to data handling, enabling better metadata cura-
Data science environments provide data labs like Kag-          tion and domain-specific data models, which are crucial
gle6 and Dryad7 with stacks of services for (externalised)     for reproducibility in scientific research.
data storage, tagging and exploring tools. These environ-         A data lake is a vast storage system that houses exten-
ments allow a collective sharing space of highly curated       sive volumes of raw data in its original format. This ver-
data collection maintenance tools. There are specialised       satile system accommodates a range of data types, includ-
repositories like DataOne8 and data repositories re3data9 .    ing structured, semi-structured, and unstructured forms.
   DataONE (Data Observation Network for Earth) is a           Data lakes are essential in environments focused on big
community-driven project that provides access to various       data analytics and are designed to manage data charac-
environmental and ecological data across multiple mem-         terised by large volume, high velocity, and diverse variety
ber repositories. It is designed as an innovative frame-       from multiple sources. They are commonly utilised for ad-
work aimed at facilitating research and enabling scien-        vanced data processing activities such as machine learn-
tists and researchers to preserve, access, use, and increase   ing and predictive analytics. Unlike traditional databases
the impact of their data. The platform provides robust         following the schema-on-write approach, data lakes fol-
data management tools, ensuring datasets’ preservation         low the schema-on-read approach, providing flexibility
and integrity. DataONE underscores data stewardship            in how data is formatted and used.
as a federated resource and supports scientific collabora-
tion and reproducibility. It is invaluable for researchers     Dataverse. The concept of dataverse takes the no-
seeking to address complex environmental challenges            tion of data lakes further by creating a networked space
through shared data and knowledge.                             where data is stored, actively managed, and shared within
   Re3data is a global registry of research data reposito-     the scientific community. A dataverse is a data repos-
ries that offers a comprehensive directory for researchers     itory platform for publishing, citing, and discovering
seeking to access, store, share, and manage their datasets.    datasets. It enables researchers to publish, cite, and dis-
It represents a variety of academic disciplines and pro-       cover datasets while providing metadata and tools to
vides detailed information about each repository, such         ensure others can understand and use data. Dataverses
as access policies, standards, and contact details. re3data    are often domain-specific and support the principles of
promotes data sharing, visibility, and reuse as a critical     open science, providing features such as data version
reference point for finding suitable repositories for data     control, digital object identifiers (DOIs) for citation, and
deposition. The platform enhances transparency in re-          tools for data analysis within the platform. They are
search data management. It supports open science by            community-driven and emphasize the accessibility and
guiding users to trustworthy and reliable repositories,        reusability of research data.
                                                                  The most prominent example is the open-source Data-
5
  https://www.qdasoftware.org                                  verse project developed by the Institute for Quantita-
6
  kaggle.com                                                   tive Social Science at Harvard University. The Dataverse
7
  https://datadryad.org/stash
8                                                              10
  https://www.dataone.org/about/                                    https://jamesdixon.wordpress.com/2014/09/25/
9
  https://www.re3data.org                                           data-lakes-revisited/
Project, initiated by King [23], provides an open-source       These repositories support open science by promoting
platform for sharing, preserving, citing, exploring, and    data sharing across disciplinary boundaries. This fea-
analysing research data. It focuses on data citation and    ture enables researchers to replicate studies and build
reproducibility, as discussed by Crosas [24], who high-     upon existing work, which is fundamental for advancing
lighted the platform’s role in fostering collaboration and  knowledge. They also facilitate interdisciplinary collabo-
open science.                                               ration, allowing experts from different fields to contribute
   Different academic institutions have built their data-   to and draw from a collective data pool. For instance, a
verses for sharing and disseminating experimental sci-      dataverse in these fields might include a combination of
entific results, including the data collections they curate:high-throughput experimental data, field observations,
University of Arizona11 , the Different universities and    and simulation outputs. The combination of openness
academic institutions have promoted their dataverses        and rigorous data management positions dataverses as
like the University of Hamburg12 , the University of Michi- critical resources in pursuing scientific discovery in Life
gan13 and the Grenoble Dataverse14 .                        & Earth sciences.
                                                               In life and earth sciences, data lakes are pivotal for con-
Summary. Together, these systems represent a shift solidating scientific data collected from various biodiver-
toward more open, integrated, and efficient ecosystems sity studies and geological events like earthquakes. Once
for data management, offering novel solutions to the curated, processed, and analysed, this data contributes
challenges posed by the vast amounts of data generated significantly to data-driven experiments underpinned by
in modern research. They move away from traditional well-established protocols. The harvested data enriches
databases and toward more fluid, dynamic systems that the data lake and supports the creation of detailed, cu-
can accommodate the ever-changing landscape of big rated views for dissemination through dataverses.
data and scientific research.                                  Our vision emphasises the importance of developing
   A dataverse and a data lake are concepts related to data and maintaining data lakes with partially curated con-
storage and management but serve different purposes and tent in life and earth sciences, facilitating the continuous
are designed with varying cases of use in mind. While cycle of experimental data feeding back into the lake and
a dataverse is a scholarly platform aimed at curating, subsequently sharing via dataverses.
sharing, and preserving research data with rich metadata
and community collaboration features, a data lake is a
more generalised and scalable storage solution for raw
                                                            3. Maintaining and sharing earth
data to support diverse data analytics and processing            and life sciences knowledge:
workflows.
                                                                          challenges
2.5. Data lakes and data verses in Life &                             Various data on life and earth sciences have been ac-
     Earth sciences                                                   quired from different sources [25]. Integrated access to
                                                                      data collections and their curated versions can facilitate
Dataverses in Life & Earth sciences are specialised digital           their maintenance, analysis and experimentation. It can
infrastructures designed to address specific data manage-             also demonstrate knowledge of the discipline with its vo-
ment needs for these scientific domains. They provide a               cabulary, concepts and relationships in a synthetic way.
structured yet flexible environment where datasets can                   Curation, maintenance and exploration of data collec-
be stored, accessed, shared, and analysed. These data-                tions in the data lake calls for proposing techniques for
verses typically offer robust metadata standards and tools            exploring data collections that can be explored and en-
to ensure their data are well-described, making them dis-             riched while producing new data and analytical results.
coverable and usable for various research purposes.                   Data curation also means keeping track of the type of
   In Life Sciences, dataverses often focus on genomics,              experiments carried out on the data, their results and the
proteomics, clinical trials, and other biological data, in-           conditions under which they were carried out. Maintain-
tegrating various sources of information to aid in com-               ing a catalogue of data-related questions and experiments
plex analyses like phenotype-genotype correlations. For               can promote open science, share data and knowledge, and
Earth Sciences, dataverses might concentrate on geospa-               share the data and knowledge the scientific community
tial data, climate models, seismic activity records, and              has gained from it [26]. This information should also be
ecological data, supporting efforts to understand and                 stored in the data lake.
model the Earth’s dynamic systems.
11
   https://arizona.figshare.com                                       Challenge 1: How to structure and organise life and
12
   https://www.fdm.uni-hamburg.de/en/fdm.html                         earth sciences metadata? Metadata modelling is a
13
   https://www.icpsr.umich.edu/web/about/cms/2365                     way of structuring and organising earthquakes and biodi-
14
   https://scienceouverte.couperin.org/cellule-data-grenoble-alpes/
versity. The metadata model must make the content of a            4. Towards a curation approach for
data lake findable, accessible, interoperable and reusable
(FAIR principles [27]). Metadata can represent the data’s
                                                                     building a Life & Earth sciences
structural, semantic and contextual aspects (provenance,             data lake
conditions and assumptions under which the analytical
results are obtained, i.e., the metadata driving the analy-           Figure 1 illustrates the principle of our vision concerning
sis). Most proposed models are based on logic or struc-               the way a life and earth sciences data lake can be built,
tured by graphs [28, 29] that can be specialised in seismic           maintained and exploited. Our approach is based on the
geophysical data and biodiversity. Besides, associating               quantitative and qualitative curation of data harvested
metadata can be achieved by considering quantitative                  digitally and in situ (left-hand side of the figure). Hetero-
and qualitative perspectives through data curation. Com-              geneous raw data is gathered and stored in the data lake.
bining quantitative and qualitative approaches allows                 Then, algorithms (statistical and Artificial Intelligence)
for a meta-model of the content used and produced in                  and researchers can process, filter and classify data. This
experiments and the conditions in which the content is                filtering process produces and stores meta-data in the
produced, chosen, validated and considered representa-                data lake. Data exploration and integration (cleaning and
tive knowledge for the domain of study.                               engineering) processes can be performed on data samples
                                                                      from the data lake. They can be used for experimental
                                                                      purposes to produce content associated with the data
Challenge 2: How to integrate data in the data lake?
                                                                      stored in the data lake. Clean and curated data associ-
Since the experiments require several data collections,
                                                                      ated with meta-data representing the quantitative and
integrating the data into the data lake must be part of
                                                                      qualitative perspective of the experiments can then be
a pipeline that includes data discovery, exploration, se-
                                                                      shared in a data verse (right-hand side of the figure).
lection and integration. This process should be designed
based on the requirements of life and earth science exper-
iments [25]. The heterogeneity of the data (text, signals, Harvested data, models and knowledge integra-
multimedia, proprietary formats from seismographs), the tion. Various life and earth sciences data have been
speed of the data often produced in the form of streams in harvested from different sources. Since they are hetero-
the case of seismic sensors in addition to the volume are geneous and produced at different paces (continuous and
aspects that require original contributions in the design, in batch), our approach proposes an integration approach
maintenance and exploration of the data lake.                         based on a pivot meta-representation. The principle is
                                                                      to present a general meta-model of their content and
                                                                      process them for extracting technical, structural and se-
Challenge 3: How to integrate data in the data lake
                                                                      mantic meta-data. This abstract representation provides
considering scientists’ needs? The researcher’s in-
                                                                      integrated access to data collections and curated versions
tervention, defined as a researcher-in-the-loop (RITL)
                                                                      under a global knowledge graph and can promote their
[30], is a crucial aspect of human intervention to assess
                                                                      maintenance, analysis, and experimentation. It can also
content concerning (i) the conditions in which it is pro-
                                                                      show the knowledge of the discipline with its vocabu-
duced and (ii) to make decisions about the new tasks
                                                                      lary, concepts, and relations in a synthetic manner. The
to perform and the way a research project will move
                                                                      data lake can be pivotal in collecting, processing, and
forward. RITL is a case of Human-in-the-loop (HITL),
                                                                      exporting raw data in a curated view.
where the primary output of the process is a selection
of the data, not a trained machine-learning model. HITL
is crucial for handling supervision, exception control, Curation, maintenance, and exploration of data
optimisation, and maintenance [31, 32]. Under a RITL collections for bringing data value from in situ ob-
approach, a human sees all data points in the relevant servations and experiments. Since data acts as a
selection at the end of the process. Using RITL requires backbone in modelling phenomena for understanding
a systematic solid way of working15 . This characteristic their behaviour, it is critical to developing good collec-
is critical for designing content curation for quantitative tion and maintenance: which are available data collec-
and qualitative research methods.                                     tions? Are they complete? Which is their provenance?
    Scientific content should be extracted and computed, In which conditions were they collected? Have they been
including data, analytics tasks (manual and AI models), processed? In which cases have they been used, and what
and associated metadata. This curated content allows the are the associated results? We propose techniques to ex-
produced knowledge to be reusable and analytics results plore data collections using graphs that can be explored
to be reproducible [33], thereby adhering to the FAIR and enriched while new data and analytics results are
principles [34].                                                      produced. Data curation also means keeping track of the
                                                                      type of experiments run on data, their results, and the
15                                                                    conditions in which they were performed. Maintaining a
   https://hai.stanford.edu/news/humans-loop-design-interactive-ai-systems
Figure 1: General overview of the curation approach for building, maintaining and exploiting a data lake.



catalogue of data-related questions and experiments can       In both cases, it is necessary to (i) apply statistical
promote open science and share data and the knowledge      methods to investigate and unveil new patterns in seisms
that the scientific community has derived from it.         and biodiversity data, answering open problems or lead-
                                                           ing to new research questions; (ii) build predictive models
 Modelling and simulating experiments to answer to better describe or approximate phenomena, increas-
questions in life and earth sciences. Answering ing the knowledge about our planet. The conditions in
research questions through data-driven experiments im- which statistics and prediction are performed, results,
plies:                                                     observations, interpretation and validation of the results
                                                           are data to be integrated into the data lake.
      • Designing ad hoc experiment artefact models and
        programming languages for enabling friendly,
                                                           Discussion. The originality of the work is to address
        context-aware, and declarative construction of
                                                           the construction of a data lake that includes:
        experiments in life and earth sciences.
      • Collecting execution of experiments data (raw           1. Raw collected data representing life and earth
        input data, prepared datasets, experiments’ tasks           sciences phenomena (streams, batch, multimedia,
        calibration and associated results).                        proprietary).
                                                                2. Data produced along data-driven experiments
Pilot experiments. The data lake will be tested in real             adopting data science techniques including ar-
scenarios through collaboration with domain experts in              tificial intelligence algorithms (ML-driven data
seismology and biodiversity studies in Brazil. The entry            lakes).
point will be two pilot experiments, namely:                    3. Contextual data describing the conditions in
     1. the classification process of seismic signals col-          which data are collected, and experiments are
        lected by stations through different observations           designed and enacted. The data lake will provide
        to detect "natural" and human-made earthquakes              data curation modules for extracting metadata
        in the northern human-made earthquakes in the               according to a well-adapted model and modules
        northern region of Brazil;                                  exploring data and using them for designing new
     2. the classification of in situ observations of the           experimentations, thereby adopting an open sci-
        "carabela portuguesa"16 and modelling its be-               ence perspective.
        haviour on the Brazilian coast.                      world, especially in the tropical and subtropical regions of the
16
     The Portuguese caravel (Physalia physalis) is a monotypic colonial   Pacific and Indian Oceans, as well as in the Atlantic Gulf Stream.
     species of siphonophore hydrozoan of the family Physaliidae. It      Its sting is dangerous and very painful https://es.wikipedia.org/
     is commonly found in the open ocean in all warm waters of the        wiki/Physalia_physalis.
5. Conclusions and future work                                [6] A. Zuiderwijk, R. Shinde, W. Jeng, What drives and
                                                                  inhibits researchers to share and use open research
Our vision is that it is necessary to address fundamental         data? A systematic literature review to analyze
research topics at the centre of Data Science, Big Data           factors influencing open research data adoption,
management and analytics for solving data-driven prob-            PloS One 15 (2020).
lems in life and earth sciences.                              [7] M. Vuorre, J. P. Curley, Curating research assets: A
  The contribution is the design and exploration tech-            tutorial on the git version control system, Advances
niques of a data lake with a well-adapted model for meta-         in Methods and Practices in Psychological Science
data about life and earth sciences experiments consuming          1 (2018) 219–236.
and producing quantitative and qualitative data. An im-       [8] M. Esteva, W. Xu, N. Simone, K. Nagpal, A. Gupta,
portant work will be to define exploration operators and          M. Jah, Synchronic curation for assessing reuse
pipelines to exploit the content for further maintaining          and integration fitness of multiple data collections
and developing new life and earth sciences experiments.           (2022).
                                                              [9] M. Woods, R. Macklin, G. K. Lewis, Researcher
                                                                  reflexivity: exploring the impacts of caqdas use, In-
6. Acknowledgements                                               ternational Journal of Social Research Methodology
The work reported in this paper is done in the context of         19 (2016) 385–403.
the LETITIA17 project, funded by the Fédération Informa-     [10] ATLAS.ti, ATLAS.ti, https://atlasti.com, last ac-
tique de Lyon18 .                                                 cessed April 2023.
                                                             [11] Dedoose, Dedoose, https://www.dedoose.com/, last
                                                                  accessed April 2023.
References                                                   [12] V. Software, Maxqda, http://maxqda.com, last ac-
                                                                  cessed April 2023.
 [1] G. Vargas-Solar, J.-L. Zechinelli-Martini, J. A.        [13] NVivo, Nvivo, https://www.qsrinternational.com/,
     Espinosa-Oviedo, L. M. Vilches-Blázquez, Laclichev:          last accessed April 2023.
     Exploring the history of climate change in latin        [14] N. Chen, M. Drouhard, R. Kocielnik, J. Suh,
     america within newspapers digital collections, in:           C. Aragon, Using machine learning to support
     New Trends in Database and Information Systems:              qualitative coding in social science: Shifting the fo-
     ADBIS 2021 Short Papers, Doctoral Consortium and             cus to ambiguity, ACM Transactions on Interactive
     Workshops: DOING, SIMPDA, MADEISD, Mega-                     Intelligent Systems 8 (2018) 1–20.
     Data, CAoNS, Tartu, Estonia, August 24-26, 2021,        [15] J. C. Evers, Current issues in qualitative data analy-
     Proceedings, Springer, 2021, pp. 121–132.                    sis software (qdas): A user and developer perspec-
 [2] L. S. do Nascimento, C. S. Hara, M. N. Junior, M. No-        tive, The Qualitative Report 23 (2018) 61–73.
     ernberg, Redes sociais como uma fonte de dados          [16] S. Karcher, D. D. Kirilova, C. Pagé, N. Weber, How
     alternativa para monitorar águas-vivas no brasil,            data curation enables epistemically responsible
     in: Livro de Memórias do IV SUSTENTARE e VII                 reuse of qualitative data, The Qualitative Report 26
     WIPIS: Workshop internancional de Sustentabili-              (2021) 1996–2010.
     dade, Indicadores e Gestão de Recursos Hídricos         [17] QDAMiner, Qdaminer, https://provalisresearch.
     (Online) – Even3, Piracicaba, 2022.                          com/products/qualitative-data-analysis-software/,
 [3] E. N. Sarr, S. Ousmane, A. Diallo, Factextract:              last accessed April 2023.
     automatic collection and aggregation of articles        [18] Quirkos, Quirkos, https://www.quirkos.com, last
     and journalistic factual claims from online news-            accessed April 2023.
     paper, in: 2018 Fifth International Conference on       [19] Transana, Transana, https://www.transana.com,
     Social Networks Analysis, Management and Secu-               last accessed April 2023.
     rity (SNAMS), IEEE, 2018, pp. 336–341.                  [20] C. Giebler, C. Gröger, E. Hoos, H. Schwarz,
 [4] D. Garkov, C. Müller, M. Braun, D. Weiskopf,                 B. Mitschang, Leveraging the data lake: Current
     F. Schreiber, " research data curation in visualiza-         state and challenges, in: Big Data Analytics and
     tion: Position paper"(data) (2023).                          Knowledge Discovery: 21st International Confer-
 [5] S. Lafia, A. Thomer, D. Bleckley, D. Akmon,                  ence, DaWaK 2019, Linz, Austria, August 26–29,
     L. Hemphill, Leveraging machine learning to detect           2019, Proceedings 21, Springer, 2019, pp. 179–188.
     data curation activities, in: 2021 IEEE 17th Inter-     [21] R. Hai, C. Quix, M. Jarke, Data lake concept and
     national Conference on eScience (eScience), IEEE,            systems: a survey, arXiv preprint arXiv:2106.09592
     2021, pp. 149–158.                                           (2021).
17
     http://vargas-solar.com/letitia/                        [22] P. Russom, Data warehouse modernization, TDWI
18
     https://fil.cnrs.fr                                          Best Pract Rep (2016).
[23] G. King, An introduction to the dataverse network
     as an infrastructure for data sharing, 2007.
[24] M. Crosas, G. King, J. Honaker, L. Sweeney, Au-
     tomating open science for big data, The ANNALS
     of the American Academy of Political and Social
     Science 659 (2015) 260–273.
[25] U. S. da Costa, J. A. Espinosa-Oviedo, M. A. Mu-
     sicante, G. Vargas-Solar, J.-L. Zechinelli-Martini,
     Using provenance in data analytics for seismology:
     Challenges and directions, in: European Confer-
     ence on Advances in Databases and Information
     Systems, Springer, 2022, pp. 311–322.
[26] A. Adorjan, G. Vargas-Solar, R. Motz, Towards
     a human-in-the-loop curation: A qualitative per-
     spective, in: 2022 IEEE/ACS 19th International
     Conference on Computer Systems and Applications
     (AICCSA), IEEE, 2022, pp. 1–8.
[27] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg,
     G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W.
     Boiten, L. B. da Silva Santos, P. E. Bourne, et al.,
     The fair guiding principles for scientific data man-
     agement and stewardship, Scientific data 3 (2016)
     1–9.
[28] E. Scholly, P. N. Sawadogo, P. Liu, J. A. Espinosa-
     Oviedo, C. Favre, S. Loudcher, J. Darmont, C. Noûs,
     goldmedal: une nouvelle contribution à la mod-
     élisation générique des métadonnées des lacs de
     données, Revue des Nouvelles Technologies de
     l’Information (2021).
[29] A. Diouan, E. Ferey, S. Loudcher, J. Darmont,
     C. Noûs, Métadonnées des lacs de données et
     principes fair, in: 18e journées Business Intelli-
     gence et Big Data (EDA 2022), 2022.
[30] R. Van de Schoot, J. de Bruin, Researcher-in-the-
     loop for systematic reviewing of text databases,
     Zenodo: SciNLP: Natural Language Processing and
     Data Mining for Scientific Text (2020).
[31] I. Rahwan, Society-in-the-loop: programming the
     algorithmic social contract, Ethics and information
     technology 20 (2018) 5–14.
[32] E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-
     Ríos, J. Bobes-Bascarán, Á. Fernández-Leal, Human-
     in-the-loop machine learning: A state of the art,
     Artificial Intelligence Review 56 (2023) 3005–3054.
[33] J. Leipzig, D. Nüst, C. T. Hoyt, K. Ram, J. Greenberg,
     The role of metadata in reproducible computational
     research, Patterns 2 (2021) 100322.
[34] P. P. F. Barcelos, T. P. Sales, M. Fumagalli, C. M. Fon-
     seca, I. V. Sousa, E. Romanenko, J. Kritz, G. Guiz-
     zardi, A fair model catalog for ontology-driven
     conceptual modeling research, Conceptual Model-
     ing. ER 73 (2022).