<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ECDP: A Big Data Platform for the Smart Monitoring of Local Energy Communities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Application Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Gagliardelli</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Zecchini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Beneventano</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Simonini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonia Bergamaschi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirko Orsini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Magnotta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emma Mescoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Livaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Gessa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piero De Sabbata</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca D'Agosta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Paolucci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Moretti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DataRiver S.r.l.</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA)</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present the Energy Community Data Platform (ECDP), a middleware platform designed to support the collection and the analysis of big data about the energy consumption inside local energy communities, with the aim of encouraging a more conscious use of energy by the users. The big data platform, commissioned by ENEA, acquires data of diferent nature (e.g., describing the measurement of the energy consumption and production, weather conditions, etc.) in a heterogeneous format from multiple sources. We describe the architecture of ECDP, designed to support a Data Integration Workflow and a Data Lake Workflow , conceived for diferent uses of the data, motivating our technological choices. Then, we illustrate several dataflows reflecting real-world use cases, which highlight the advantages ofered by the designed architecture for diferent types of users. The main strengths of the presented big data platform are flexibility and scalability (guaranteed by its modular architecture), which allow its applicability to any type of local energy community.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Big Data Integration</kwd>
        <kwd>Energy Communities</kwd>
        <kwd>Big Data Platform</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>two main requirements to be satisfied: (i) it must allow
to preprocess the data in a batch and periodical way,
The Energy Community Data Platform (ECDP), commis- then to store it, making it available for data analysis
sioned by ENEA (the Italian National Agency for New without having to repeat these operations every time;
Technologies, Energy and Sustainable Economic Devel- (ii) it must integrate heterogeneous data acquired from
opment), is a middleware platform we designed to collect multiple sources with diferent characteristics. To
comand analyze big data about the energy consumption in- ply with the first requirement, ECDP exploits an eficient
side Local Energy Communities (LEC), with the aim of data lake storage layer, namely Delta Lake [1]. Storing
encouraging a conscious use of energy by the users, at the data in Delta Lake allows to mitigate the excessive
home and in the workplace. execution time needed to import large files or to
im</p>
      <p>ECDP is designed for the acquisition of data from dif- port data in bulk from database management systems.
ferent dataflows in a heterogeneous format, the proper For the second requirement, the architecture of the big
management of the workflows to retrieve and store the data platform is centered on MOMIS (Mediator
EnvirOngreat amount of data acquired from diferent sources ment for Multiple Information Sources), an open-source
and utilities (of public or private nature), and function- data integration system [2, 3, 4] which adopts a
semanalities of data integration, transformation, and cleaning, tic approach to the integration of diferent data sources,
required to make the data ready to run queries on it and also making the acquisition of new sources easier. ECDP
for its use in data analysis and visualization operations. was developed following the wrapper/mediator
architec</p>
      <p>The design of the big data platform was driven by ture of MOMIS, which allows to aggregate information
several real-world use cases, which allowed to detect from heterogeneous data sources (both structured and
semi-structured) and make it homogeneous, in a
semiautomatic way. Thus, it makes possible to obtain a single
unified source, without any redundancy or coniflct in
data.</p>
      <p>In Section 2, we describe the architecture of the big
data platform, motivating the structural choices and the
adopted technologies. The main strengths of the
proPublished in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
Conference (March 29–April 1, 2022, Edinburgh, UK)</p>
      <p>0000-0001-5977-1078 (L. Gagliardelli); 0000-0002-4856-0838
(L. Zecchini); 0000-0001-6616-1753 (D. Beneventano);
0000-0002-3466-509X (G. Simonini); 0000-0001-8087-6587
(S. Bergamaschi); 0000-0002-5087-9530 (M. Orsini)</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)
posed solution are flexibility and scalability. ECDP was design of the platform was carried out by the Database
designed in a modular manner as a flexible and open Research Group (DBGroup5) of the University of
Modsystem, to ease future extensions for the implementa- ena and Reggio Emilia, leveraging its experience in big
tion of new functionalities, diferent ways for connecting data integration and analysis [5, 6, 7, 8, 9], jointly with
to and retrieving data from devices or services, and the DataRiver6, which subsequently took care of its
impleintegration with the other components of the software mentation. Both phases complied with the specifications
architecture or software systems present inside the LEC. and the technical documentation provided by ENEA and
These features allow an extremely wide range of applica- saw a continuous interaction with its project team.
tions for ECDP, making it suitable for any type of local
energy community.</p>
      <p>The flexibility of ECDP is highlighted in Section 3, 2. The Energy Community Data
where we illustrate its adaptability to several real-world Platform (ECDP)
use cases that drove the design process, addressing
different application areas. These use cases are related to This section illustrates the architecture of the Energy
projects carried out by ENEA1, namely SelfUser2, aim- Community Data Platform (ECDP), represented in
Figing at maximizing the collective self-consumption based ure 1, whose main characteristics and requirements were
on renewable energy in a condominium (Section 3.2), presented in Section 1.
and PELL3, to optimize the energy consumption of
public lighting (Section 3.3). Moreover, also the European 2.1. Data Sources
project GECO4 about green energy communities was
considered as a reference for the design of the architecture.</p>
      <sec id="sec-1-1">
        <title>Contributions by the groups</title>
        <p>The presented big data platform was designed and
developed within the project “Smart Monitoring of a Local
Energy Community” (2020-2021), supervised by ENEA. The</p>
        <sec id="sec-1-1-1">
          <title>1https://www.enea.it 2http://www.selfuser.it 3https://www.pell.enea.it 4https://www.gecocommunity.it</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>The three modules identified as "Data Sources" in Figure 1 reflect the existing scenario and the solutions adopted by ENEA, and constitute the basis to build the big data platform on.</title>
          <p>Data sources are multiple and heterogeneous, and will
certainly increase in number and change over time. The
data about the functioning of the LEC can be static or
dynamic. Static data is collected once and stored in tables
5https://dbgroup.unimore.it
6https://www.datariver.it/en
of a relational DBMS; it can be related to people (e.g., per- them. From the Data Hall server it is possible to access
sonal information), buildings (e.g., geopositioning or exte- the second server, called Data Collector, via remote port
rior insulation), or apartments/ofices (e.g., air condition- forwarding. This server supports a PostgreSQL10 DBMS,
ing). Dynamic data, which is collected by sensors (contin- which represents the reference for structured data
inuously or at diferent intervals) or received from outside stead (e.g., the ones related to SelfUser). Moreover, it
the LEC (e.g., from energy providers), is mainly related is possible to exploit additional database management
to the energy consumption, production, or accumulation. systems for testing which are available on other local
In the considered use cases, for example, ECDP has to deal network servers.
with data about the energy consumption/production in
condominiums and single apartments, measured with a 2.2. Data Workflow
granularity of one second and sent weekly in CSV format
as ZIP archives (i.e., condominium data) or measured with The module on the right in Figure 1 represents ECDP,
a granularity of 15 minutes and sent continuously as CSV the big data platform designed to extract value from this
ifles (i.e., smart meter data), but also meteorological data great amount of ingested data. To ensure maximum
flexcollected by weather stations with a granularity of one ibility, allowing diferent types of users to operate on
minute and stored as JSON files, and many other types data in diferent ways, it is possible to define two
speof data, such as information about the presence of the cific workflows (with the related functional modules): (i)
personnel in the ofices (in TXT format). a Data Integration Workflow , designed to perform data</p>
          <p>Furthermore, the considered projects also present integration, which provides the main access point to the
some specific additional data. SelfUser adds to the de- unified data for every application; (ii) a Data Lake
Workscribed scenario structured data stored in a PostgreSQL lfow , which allows to store the whole historical data in
DBMS, providing information about the energy consump- a raw (i.e., not integrated) form, available to be handled
tion, the energy production by photovoltaic panels, and by expert users if needed. Both systems are coordinated
weather conditions. This structured data presents a gran- through a Data Workflow Management System , used to
ularity of 15 minutes and is obtained by preprocessing define the settings for the operations needed to
mansome of the raw data described above. PELL platform age the data workflow (e.g., new input data activating
deals with static data (e.g., the position of the lighting triggers/ETL, failure alerts, error-handling). These three
devices), stored using a MySQL DBMS, and dynamic data modules are detailed in the following subsections.
(e.g., information about the energy consumption,
measured by smart meters), stored in HDFS [10] according 2.2.1. Data Integration Workflow: MOMIS
to the UrbanDataset data structure7.</p>
          <p>The ingestion of the described input data can be
managed in diferent ways. The main solution adopted
by ENEA is based on the open-source system
integration framework Apache Camel8, running in a deeply
customized OSGi Apache Karaf9 container, named
SignalMix. In this environment, both specific custom
components and Camel allow to use domain specific languages
(namely, a Java DSL and a Spring XML DSL) to define
routes, containing flow and logic of integration between
diferent systems, protocols, and formats, supporting
most of the enterprise integration patterns. However,
also diferent solutions are adopted depending on the
source (e.g., some JSON/CSV files are sent directly by the
users via e-mail).</p>
          <p>ENEA exploits two web servers to store the ingested
data. The central element is the Data Hall server, which
supports the so-called “Triage Area”, which is the
reference for the storage of unstructured data and contains
the files (whose format can be CSV, JSON, etc.) as they
are acquired from the sources by data ingestion systems,
without applying further preprocessing operations on</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>The chosen data integration tool is MOMIS, and it was a</title>
          <p>natural choice, since this open-source system currently
managed by DataRiver was designed by the DBGroup
and played a central role in its research activities for
several years [11, 12].</p>
          <p>MOMIS is based on a wrapper/mediator architecture:
the wrapper makes available a data source to be
integrated, then the mediator performs data fusion [13] to
generate in a semi-automatic way a mediated schema,
called Global Virtual View (GVV) or Global Schema (GS),
of the schemas of the local sources. The user can query
this schema (through the query manager or through
thirdparty applications) to obtain a complete and unified view
of the data contained in the local sources. MOMIS is a
virtual data integration system, i.e., the data is retrieved
from the sources at query time. This allows to avoid
data replication, so that each query returns updated data.</p>
          <p>However, it is possible to materialize some global classes
(materialized views); this can be useful to optimize the
retrieval of frequently used data or complex queries.</p>
          <p>In particular, ECDP is based on MOMIS I4.0, the
specific industrial IoT extension of MOMIS for Industry 4.0
[4], a web and mobile application designed to efectively
7https://smartcityplatform.enea.it/UDWebLibrary
8https://camel.apache.org
9https://karaf.apache.org
10https://www.postgresql.org
collect and manage big data generated by machinery and
sensor networks in industrial processes, exploiting
artificial intelligence and machine learning techniques. This
platform provides tools for the continuous monitoring
and advanced services for the real-time analysis of
production and quality performance, which allows to learn
from experience for predictive maintenance policies,
production process optimization, and energy consumption
reduction. The technology stack of MOMIS I4.0 is
illustrated in Figure 2.</p>
          <p>Data ingestion is performed by software modules
called wrappers, which allow to connect to data sources,
extracting their schemas and features. The interfaces Figure 2: MOMIS Industrial IoT Technology Stack.
created through the analysis of these schemas allow to
represent the heterogeneous sources in a common lan- Data querying and export services are used to allow
auguage, making them homogeneous through data integra- thorized users and applications to access a certain portion
tion services. Several wrappers were created for ECDP to of data or aggregate view. The access is managed through
obtain the representation of data from diferent dataflows: a role-based access control to guarantee to each type of
(i) a version of the CSV wrapper for detecting the ZIP user the right to access the needed information and at
archives about condominium data on the Data Hall server the same time the security of the information for which
and extracting their content; (ii) a version of the CSV the authorization was not granted. ECDP relies on two
wrapper for detecting the files about smart meter data distinct APIs: (i) ExportUD, to query the integrated table
on the Data Hall server; (iii) a version of the JSON wrap- of readings and export the values in UrbanDataset-XML,
per for detecting the files about meteorological data on UrbanDataset-JSON, or raw CSV format; (ii) SQL_API, to
the Data Hall server; (iv) the PostgreSQL wrapper for run the user queries based on the queries defined in the
connecting to the DBMS on the Data Collector server application and the related access authorizations.
to ingest SelfUser data; (v) the Delta Lake wrapper to in- ECDP also supports a customized script scheduling
gest PELL data from the data lake. Moreover, an MQTT, service, which can be used to run scripts on MOMIS
an OPC UA, and a WoT wrapper, not used in the final server. This service simply runs the main class of the
version, were implemented and initially considered to script (in Bash or MATLAB language) and is extremely
directly ingest data from smart meters. lfexible, since it can use the GUI to collect diferent scripts</p>
          <p>The software module for data integration exploits the and to define the configuration for their scheduling.
MOMIS mediator to obtain the semantic representation Finally, the source management and storage services
of data sources, allowing to create an ontology of the allow to monitor the data collected and processed by
domain, called Renewable Energy Community (REC) On- the other software modules. In particular, the source
tology. The semantic integration is used to perform the management dashboard was realized using MOMIS data
mapping between data sources and the mediated schema. analytics module (i.e., MOMIS Dashboard). It supports a
Novel services to support data integration were devel- unified view of the business data integrated with external
oped for ECDP, allowing to intervene on dataflows with data sources, searching and monitoring aggregate data
transformation operations such as applying mathemati- from distributed and heterogeneous data sources,
visualcal operators, computing average/minimum/maximum izing indicators on charts and dynamic tables, managing
values, performing time conversion to diferent formats security and visibility of data based on roles and user
(e.g., UNIX format) and time-based missing value im- groups.
putation, adding or removing tuples according to data
timestamp. 2.2.2. Data Lake Workflow: Delta Lake</p>
          <p>ECDP exploits a hybrid storage, based on two solutions
for diferent uses: (i) Delta Lake storage layer, used to The estimated size of data and the available resources
store big data for ofline data historicization and analy- allow to store raw data (i.e., data as it is acquired, without
sis (if the data timestamp has passed the retention time, applying any preprocessing operation on it) in a data lake
the data is historicized in Delta Lake and removed from which lays on HDFS. The main goal of this workflow is to
PostgreSQL); (ii) PostgreSQL relational DBMS optimized support data analysis performed by advanced users who
through PostGIS and TimescaleDB extensions for query- need to access raw data to retrieve additional information
ing temporal series, used to answer most frequent queries that cannot be obtained from integrated data. For
examon recent data, and containing synthesis of the data his- ple, raw sensor data collected with a granularity of one
toricized in Delta Lake. second is stored in the data lake, while in Data
Integra</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>2.2.3. Data Workflow Management System:</title>
      </sec>
      <sec id="sec-1-3">
        <title>Apache Airflow</title>
        <p>tion Workflow it is aggregated according to a granularity quirements, we chose Apache Airflow 13, which allows to:
of 15 minutes. Thus, if an advanced user needs data at (i) manage task workflows through simple Python scripts
the finest granularity, it is necessary to exploit Data Lake defining the dependencies among tasks, whose execution
Workflow. The chosen technology for the data lake is order can be represented with a directed acyclic graph
Delta Lake [1], an open-source project which allows to (each module is a black box, so it can be executed by any
manage a great amount of data using existing storage tool and implemented with any programming language);
tools such as HDFS. Delta Lake is integrated with the (ii) monitor the workflow (even through a GUI) and
manwhole Apache Spark11 ecosystem, allowing the native use age the possible errors generated by tasks, using scripts
of all its libraries (e.g., MLib for machine learning) and to to define the trigger rules.
execute eficient elaborations on data both in batch and in
streaming mode. The possibility of exploiting this
powerful engine to operate on raw data was one of the core 3. Use Cases and Scenarios
reasons for the choice of Delta Lake, also considering the
significant role played by Spark in the DBGroup research
activity [14, 15, 16]. Delta Lake is a mature device, used
by multiple cloud service providers such as Databricks
and Microsoft Azure, which supports ACID transactions,
guaranteeing the highest level of isolation (this allows
to execute at the same time read and write operations
without issues about data integrity). Delta Lake exploits
Apache Parquet12, an open-source format which operates
data compression, requiring much less space for data
storage than the one required by the original format (usually
CSV). Moreover, it supports data versioning, tracking the
operations executed on data through a system of logs
and allowing to execute rollback, it allows to modify data
schema, and it handles metadata as it were data.</p>
        <p>In this section we report three dataflows related to
realworld use cases that show how ECDP can be used in
diferent application scenarios. In particular, we want
to illustrate: (i) how unstructured data is ingested
(Section 3.1); (ii) how structured data is managed (Section 3.2);
(iii) how it is possible to use Spark to directly import data
in the platform (Section 3.3).</p>
        <p>The chosen dataflows let us highlight the advantages
and the flexibility ofered by the coexistence of the two
distinct workflows, using a data lake to store the entire
raw data (i.e., Data Lake Workflow ) in combination with
a structured database to store the integrated data (i.e.,
Data Integration Workflow ). In fact, this solution allows
the standard users to access an integrated and clean view
of the data (e.g., they can see in a dashboard the hourly
energy production by the photovoltaic panels related to
the weather conditions), while advanced users can use
the raw data to perform advanced tasks (e.g., a data
scientist can use the data about the energy production by
the photovoltaic panels collected at the finest
granularity combined with the solar radiation collected by the
sensors to train a machine learning model to predict the
energy production).</p>
        <sec id="sec-1-3-1">
          <title>The goal of the Data Workflow Management System is</title>
          <p>the orchestration of the diferent functional modules. In
particular, when new data is made available from the
sources, it has to provide the mechanisms to
automatically identify it and to raise the appropriate workflows.</p>
          <p>Furthermore, it has to support the handling of the
presence of unexpected errors and it must allow to compose 3.1. Data Hall Dataflow
workflows by combining diferent software modules. The
chosen tool has to be compatible with these modules, Data Hall is the web server managed by ENEA on which
programmable, and simple to use. Considering these re- is located the so-called "Triage Area" (see Figure 1), that
hosts unstructured raw data (e.g., JSON and CSV files).
11https://spark.apache.org
12https://parquet.apache.org
13https://airflow.apache.org
before, and stores it in a PostgreSQL database optimized
to manage time series (Figure 4-1). Since this data was
already preprocessed and aggregated, no further
transformations are needed and can be made available to external
data analysis services, such as MOMIS Dashboard and
REST APIs (Figure 4-2). Again, the integrated data is
periodically moved from PostgreSQL to Delta Lake
(Figure 4-3), where advanced users can operate on it using
Spark (Figure 4-4).
3.3. Delta Lake Dataflow
The web server mainly contains data of the SelfUser Public Energy Living Lab (PELL) is a platform developed
project, which aims to test the Clean Energy Package by ENEA to collect data about the energy consumption
directive by creating an innovative plant, in a pilot form, of public lighting, which represents a key task for urban
to support the energy transition through the promotion renewal. The platform uses MySQL to store static
inforof energy produced by renewable sources (in this case, mation about the lighting devices (e.g., the position and
photovoltaics). The data, such as the energy consump- other technical details) and HDFS to store the
consumption, the energy production by photovoltaic panels, and tion of each device (dynamic data).
weather conditions, is collected from several sensors The dataflow is described in Figure 5. The static and
dyplaced in smart buildings. The data can be considered namic data collected from the PELL server is combined by
as a time series with a diferent time granularity: the en- using a Spark script that produces a DataFrame which is
ergy consumption/production is measured every second, stored into Delta Lake (Figure 5-1). From Delta Lake, the
while the weather information every minute. data can be directly queried by advanced users through</p>
          <p>The dataflow is described in Figure 3. Firstly, MOMIS the connectors provided by Spark (Figure 5-4) or it can be
(Figure 3-1) acquires the new data from the Data Hall processed by MOMIS (Figure 5-2). MOMIS reads the data
server by using its wrappers and stores it into Delta Lake, from Delta Lake, performs data integration operations,
where it can be used for further analysis by directly query- and stores the data in a PostgreSQL database equipped
ing the data lake with the connectors provided by Spark with extensions to manage time series; in fact, the data
(Figure 3-4). Then, raw data is processed by MOMIS about the energy consumption can be seen as a time
seto clean and integrate it. The resulting integrated data ries, since every record is associated with a specific time.
is materialized in a PostgreSQL database with PostGIS The users can access the integrated data by querying the
and TimescaleDB extensions: this data can be consid- database through a dashboard with predefined views, or
ered as a time series with also geographic coordinates, by using REST APIs (Figure 5-3). Periodically, after a
prereflecting the position of the sensors (Figure 3-2). The defined retention time, to avoid losing in performance
integrated data is made available for external services the integrated data is moved from PostgreSQL to Delta
(e.g., MOMIS Dashboard, REST APIs) that can be used to Lake (Figure 5-5).
analyze the most recent data (Figure 3-3). Then, after a
predefined retention time, the integrated data is moved
from PostgreSQL to Delta Lake (Figure 3-5) to avoid los- 4. Conclusions
ing in performance, creating an historical view of the
data.
3.2. Data Collector Dataflow
Data Collector is the web server managed by ENEA that
hosts the PostgreSQL database containing the structured
data of the SelfUser project. The database stores a cleaned
and aggregated version of the raw data about the energy
consumption, the energy production, and weather
conditions; all measurements are aggregated at intervals of 15
minutes.</p>
          <p>The dataflow to import this data in ECDP is described in
Figure 4. MOMIS connects to the Data Collector server by
using the apposite wrapper, acquires the data never read</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>We presented the Energy Community Data Platform</title>
          <p>(ECDP), a big data platform designed for the smart
monitoring of local energy communities, to encourage a more
conscious use of energy by the users. For this purpose,
ECDP can be useful both for the administrators of the
energy communities, allowing a better monitoring of the
community performance and a classification of the
energy consumption profiles, and for the users, which can
set energy (self-)consumption targets and receive
feedback about their adherence to these plans, adapting their
behavior accordingly in a conscious way. The modular
architecture of ECDP, conceived to support diferent uses
of data by diferent types of users, has the goal of
maximizing flexibility and scalability. As illustrated through</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgements</title>
      <sec id="sec-2-1">
        <title>The project was funded by the Italian Ministry of Eco</title>
        <p>nomic Development as a part of the 2019-2021 National
Electricity System Research Plan.
real-world use cases, these features represent the main
strengths of the presented big data platform and make
ECDP suitable for being applied to any type of local
energy community.</p>
        <p>The main lesson that we learned from this project,
which makes our experience relevant and reusable for
related tasks, is that maintaining the workflows separated
through a modular architecture allows to combine the
strengths of each adopted tool, guaranteeing the
availability of data at diferent abstraction levels, making the
system more flexible to better support future changes,
and meeting the needs of the diferent types of users. The
data lake (Delta Lake) can store in an eficient way a huge
amount of raw data, allowing to perform further analysis
operations on it. The data integration system (MOMIS)
can clean and integrate the raw data and store it into the
relational DBMS (PostgreSQL with TimescaleDB
extension). The relational DBMS can be used as a cache to store
materialized views of the most recent integrated data (old
ones are moved to the data lake) fastening the access by
the data analytics tools (MOMIS Dashboard). A single
workflow cannot guarantee this flexibility: directly
integrating and storing the whole data in a relational DBMS
would not allow to exploit raw data to perform new types
of analysis in the future, while relying only on Delta Lake
raises significant eficiency issues, since its Spark-based
interaction requires to load the whole dataframe in
memory every time. Abstracting from the specific case, this
lesson can be useful for every project dealing with the
management and the analysis of time series, proposing
an approach to face this challenging task for big data
platform design and implementation.
for the Lexical Annotation of Domain Ontologies,</p>
        <p>Int. J. Semantic Web Inf. Syst. 3 (2007) 57–80.
[4] L. Magnotta, L. Gagliardelli, G. Simonini, M. Orsini,</p>
        <p>S. Bergamaschi, MOMIS Dashboard: A Powerful
Data Analytics Tool for Industry 4.0, in: TE 2018,
volume 7 of Advances in Transdisciplinary
Engineering, IOS Press, 2018, pp. 1074–1081.
[5] G. Simonini, S. Bergamaschi, H. V. Jagadish, BLAST:
a Loosely Schema-aware Meta-blocking Approach
for Entity Resolution, Proc. VLDB Endow. 9 (2016)
1173–1184.
[6] G. Simonini, G. Papadakis, T. Palpanas, S.
Bergamaschi, Schema-Agnostic Progressive Entity
Resolution, in: ICDE 2018, IEEE Computer Society, 2018,
pp. 53–64.
[7] G. Simonini, L. Zecchini, S. Bergamaschi, F.
Naumann, Entity Resolution On-Demand, Proc. VLDB</p>
        <p>Endow. 15 (2022).
[8] G. Papadakis, L. Tsekouras, E. Thanos, N. Pittaras,</p>
        <p>G. Simonini, D. Skoutas, P. Isaris, G.
Giannakopoulos, T. Palpanas, M. Koubarakis, JedAI3: beyond
batch, blocking-based Entity Resolution, in: EDBT
2020, OpenProceedings.org, 2020, pp. 603–606.
[9] F. Guerra, G. Simonini, M. Vincini, Supporting</p>
        <p>Image Search with Tag Clouds: A Preliminary
Approach, Adv. Multim. 2015 (2015) 439020:1–
439020:10.
[10] D. Borthakur, HDFS Architecture Guide, Hadoop</p>
        <p>Apache Project (2008).
[11] S. Bergamaschi, D. Beneventano, F. Guerra,</p>
        <p>M. Orsini, Data Integration, in: Handbook of
Conceptual Modeling, Springer, 2011, pp. 441–476.
[12] S. Bergamaschi, D. Beneventano, F. Mandreoli,</p>
        <p>R. Martoglia, F. Guerra, M. Orsini, L. Po, M. Vincini,
G. Simonini, S. Zhu, L. Gagliardelli, L. Magnotta,
From Data Integration to Big Data Integration,
in: A Comprehensive Guide Through the Italian
Database Research, volume 31 of Studies in Big Data,</p>
        <p>Springer International Publishing, 2018, pp. 43–59.
[13] J. Bleiholder, F. Naumann, Data fusion, ACM
Com</p>
        <p>put. Surv. 41 (2008) 1:1–1:41.
[14] G. Simonini, L. Gagliardelli, S. Bergamaschi, H. V.
Ja[1] M. Armbrust, T. Das, S. Paranjpye, R. Xin, S. Zhu, gadish, Scaling entity resolution: A loosely
schemaA. Ghodsi, B. Yavuz, M. Murthy, J. Torres, L. Sun, aware approach, Inf. Syst. 83 (2019) 145–165.
P. A. Boncz, M. Mokhtar, H. V. Hovell, A. Ionescu, [15] L. Gagliardelli, G. Simonini, D. Beneventano,
A. Luszczak, M. Switakowski, T. Ueshin, X. Li, S. Bergamaschi, SparkER: Scaling Entity
ResoluM. Szafranski, P. Senster, M. Zaharia, Delta Lake: tion in Spark, in: EDBT 2019, OpenProceedings.org,
High-Performance ACID Table Storage over Cloud 2019, pp. 602–605.</p>
        <p>Object Stores, Proc. VLDB Endow. 13 (2020) 3411– [16] L. Gagliardelli, S. Zhu, G. Simonini, S. Bergamaschi,
3424. BigDedup: A Big Data Integration Toolkit for
Du[2] S. Bergamaschi, S. Castano, M. Vincini, Semantic plicate Detection in Industrial Scenarios, in: TE
Integration of Semistructured and Structured Data 2018, volume 7 of Advances in Transdisciplinary
Sources, SIGMOD Rec. 28 (1999) 54–59. Engineering, IOS Press, 2018, pp. 1015–1023.
[3] S. Bergamaschi, P. Bouquet, D. Giacomuzzi,</p>
        <p>F. Guerra, L. Po, M. Vincini, An Incremental Method</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>