<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data Analytics for Environmental Monitoring</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Angelica Lo Duca</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rosa Lo Duca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Environmental Monitoring, Artificial Intelligence, Data Analytics</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ARPA Lazio</institution>
          ,
          <addr-line>Representative Ofice, 00187, Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Informatics and Telematics of the National Research Council</institution>
          ,
          <addr-line>via G. Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Environmental monitoring is essential to intervene promptly in environmental disasters or emergencies. Therefore, in the environmental field, as in other fields, it is essential to create reports quickly. In this article, we describe a possible system that uses Artificial Intelligence (AI) to automatically create environmental reports, always considering the supervision of the environmental expert. The system involves two phases, one for testing and the other for production. During the testing phase, the AI system is calibrated on a subset of the available data, while during the production phase, the AI system is fully operational and works on all the available data. We describe a practical case study using the temperature data provided by ARPA Lazio.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Artificial Intelligence (AI) has made substantial progress thanks to the difusion of Large
Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These models ofer new opportunities for professionals, speed up the
execution times of specific tasks, and provide increasingly advanced models for solving complex tasks
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Applying its potential to environmental data analysis means opening a window onto a scenario
of considerable global interest. Institutions, associations, companies in the sector, and citizens need
instructions on how to read environmental data. Institutions are called upon to make targeted decisions
to safeguard the environment. Associations and companies in the sector have the task of calibrating
their products and services based on specific criteria that guarantee sustainable development. With their
daily actions and choices, simple citizens can make a diference in safeguarding the entire ecosystem.
Our work fits into this perspective: the objective is to put AI at the service of the aforementioned
categories to help them understand environmental data and make responsible and concrete choices
in the environmental context. In this paper, we propose using LLMs to build environmental reports
automatically. The proposed system relies on a preliminary phase, called the testing phase, which
calibrates the AI model utilizing the support of an environmental expert. In the second production
phase, the AI system automatically answers the environmental expert’s questions about the data. We
also describe a preliminary case study applied to the historical series of temperatures provided by
ARPA Lazio [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Cao et al. have proposed a similar framework for analyzing environmental data
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Compared to their work, our system is based on a calibration of the AI model based on human
contribution.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Our Approach</title>
      <p>We propose a system combining manual supervision with automatic analytics. There are two actors:
the environmental expert, a human with deep knowledge of environmental data, and the AI model,</p>
      <p>CEUR</p>
      <p>ceur-ws.org
Instructions</p>
      <p>Manual Data</p>
      <p>Analytics
AI Automatic
Data Analytics</p>
      <p>Manual
Evaluation
Testing Phase</p>
      <p>Production Phase
Environmental</p>
      <p>Data</p>
      <p>Final
Instructions</p>
      <p>AI Automatic
Data Analytics</p>
      <p>Manual Control</p>
      <p>
        Report
based on LLMs, which is tuned with environmental data through a Retrieval Augmented Generation
(RAG) process [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. At the moment, the AI model ingests only the environmental data, but in the
future, it could also inject other environmental documents and reports, helping it better understand
the environmental context in which it must operate. Our system comprises two phases, as shown
in Figure 1: the testing and production phases. The AI system is calibrated during testing to define
the input instructions correctly. This involves subsequent steps. The workflow starts from the upper
right of the figure and follows the flow defined by the arrows. Given the database containing all the
environmental data, the environmental expert carefully extracts a data subset representative of all
the data. Subsequently, the environmental expert defines a series of questions to ask the data. These
questions are translated into instructions for the AI model. At the same time, the AI model and the
environmental expert proceed with a separate analysis of the data sample. Both respond to the questions
posed by the environmental expert manually, through standard data analysis techniques and the AI
model through automatic analysis. At the end of the analysis, the environmental expert compares
the results obtained by AI with their own results. If the environmental expert detects problems in the
answers provided by the AI to the questions, they proceed to reformulate the instructions for the AI
model. This process continues until the AI correctly answers the questions asked. Once the evaluation
phase is complete, the model is calibrated so we can move on to the production phase, where only the
AI model is used for analysis of the entire database of data. At the end of the analysis process, the AI
model generates a report with all the answers to the questions. The environmental expert evaluates the
ifnal results. We implemented the AI model as a RAG-based system using LangChain [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], ChromaDB
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and OpenAI GPT-4 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Case Study</title>
      <p>
        To test the efectiveness of our proposed approach, we analyzed the historical series of temperature
data recorded in the decade 2013-2023 from the AL007 station of the Micro Meteorological Network
(RMM) managed by ARPA Lazio and located in via Boncompagni 101 in Rome [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The RMM is made
up of 9 stations located throughout the Lazio region, each made up of classic meteorological sensors
(temperature, humidity, pressure, and precipitation) associated with instrumentation dedicated to the
dispersion of pollutants (sonic anemometers, pyranometers, and pyrgeometers). These tools are digital.
In particular, an HMP 45 AC type thermo-hygrometer measures temperature and can detect the air’s
relative humidity. The temperature values are taken every 30 minutes from each other. Starting from
this data, we have focused only on temperature.
      </p>
      <p>Through traditional data pre-processing techniques, the environmental expert (one of the authors)
extracted average temperature values, maximum and minimum values, and mode for each month
and considered each year. Next, they formulated the following questions, with an increasing level of
complexity: 1) In which month and year was the highest temperature value recorded? 2) What was
the hottest month for each year? 3) What is the percentage increase in temperature over the decade
2013-2023? We transformed the previous questions into instructions for the AI model. We defined the
following basic structure for all the questions:
Consider {context} where:
- Year is the year
- Month is the month
- Mean is the average monthly temperature
- Max is the maximum monthly temperature
- Min is the minimum monthly temperature
- Mode is the most frequent monthly temperature value
Answer the following question: {question}</p>
      <p>
        The context variable contains the input data, and the question variable is the specific question to be
answered. We implemented the AI model as a RAG-based system using LangChain [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], ChromaDB [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
and OpenAI GPT-4 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In the first phase of the test, the AI model read and extracted data from only four years (2013, 2014,
2015, and 2020) compared to the entire available sample (2013-2023). Considering the entire data sample,
the AI model answered the first two questions incorrectly. However, if we narrow it down to four years,
the answers are correct. Regarding the last question, the AI model did not respond by providing a
numerical value, but it ofered an interesting result: it precisely described the useful steps to answer the
question, adding that it was not able to apply this procedure as it lacked a significant data, that of 2023,
which does not fall within the data of the years known to it.</p>
      <p>As a second step, we modified the structure of the questions to force the model to read all the years.
We defined the following structure for all the questions:
I have many datasets from 2013 to 2023.</p>
      <p>Consider {context} where:
- Year is the year
- Month is the month
- Mean is the average monthly temperature
- Max is the maximum monthly temperature
- Min is the minimum monthly temperature
- Mode is the most frequent monthly temperature value
Answer the following question considering all the data from 2013 to 2023: {question}</p>
      <p>Compared to the previous version of the questions, we added more details related to the dataset
range (highlighted in the previous code). The AI model answered the third question correctly, while it
could not answer completely the first and second questions because it did not find data related to 2021,
2022, and 2023. The refining process throughout the testing phase should continue until the AI model
can answer all the questions correctly. Our preliminary experiments demonstrated that although the
AI model could not answer all the questions correctly, we can conclude that preliminary results are
encouraging compared to the questions we answered manually. As a future work, we plan to improve
the testing phase and also implement the production phase of the project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Sani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sroká</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burattin</surname>
          </string-name>
          ,
          <article-title>Llms and process mining: Challenges in rpa: Task grouping, labelling and connector recommendation</article-title>
          ,
          <source>in: International Conference on Process Mining</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Gebreab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Salah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jayaraman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. H. ur Rehman</surname>
          </string-name>
          , S. Ellaham,
          <article-title>Llm-based framework for administrative task automation in healthcare</article-title>
          ,
          <source>in: 2024 12th International Symposium on Digital Forensics and Security (ISDFS)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Olaoye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jonathan</surname>
          </string-name>
          ,
          <article-title>The evolving role of large language models (llms) in banking</article-title>
          ,
          <source>EasyChair Preprint no. 13367</source>
          ,
          <issue>EasyChair</issue>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazio</surname>
          </string-name>
          , Dati rete micro-meteorologica,
          <fpage>2013</fpage>
          -
          <lpage>2023</lpage>
          ,
          <article-title>Rete micro-meteorologica -</article-title>
          <string-name>
            <surname>ARPA Lazio</surname>
          </string-name>
          ,
          <year>2023</year>
          . Https://www.arpalazio.it/rete-micro-meteorologica
          <source>(Last Access</source>
          <year>2024</year>
          /07/15).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Llm-assisted modeling and simulations for public sector decision-making: Bridging climate data and policy insights</article-title>
          ,
          <source>in: AAAI-2024 Workshop on Public Sector LLMs: Algorithmic and Sociotechnical Design</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          , et al.,
          <article-title>Retrievalaugmented generation for knowledge-intensive nlp tasks</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] LangChain, https://langchain.com/,
          <year>2024</year>
          . (
          <issue>Last Access 2024</issue>
          /07/15).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] Chroma core/chroma, The AI-native open-source embedding database</article-title>
          , https://github.com/chromacore/chroma,
          <year>2024</year>
          . (
          <issue>Last Access 2024</issue>
          /07/15).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10] OpenAI, GPT-4
          <source>Technical Report</source>
          , arXiv.org,
          <year>2023</year>
          . Https://doi.org/10.48550/arXiv.2303.08774.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>