1. Introduction

ECDP: A Big Data Platform for the Smart Monitoring of Local Energy Communities

(Application Paper)

Luca Gagliardelli

Luca Zecchini

Domenico Beneventano

Giovanni Simonini

Sonia Bergamaschi

Mirko Orsini

Luca Magnotta

Emma Mescoli

Andrea Livaldi

Nicola Gessa

Piero De Sabbata

Gianluca D'Agosta

Fabrizio Paolucci

Fabio Moretti

1 0 DataRiver S.r.l. , Modena , Italy 1 Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA) , Bologna , Italy 2 University of Modena and Reggio Emilia , Modena , Italy

In this paper we present the Energy Community Data Platform (ECDP), a middleware platform designed to support the collection and the analysis of big data about the energy consumption inside local energy communities, with the aim of encouraging a more conscious use of energy by the users. The big data platform, commissioned by ENEA, acquires data of diferent nature (e.g., describing the measurement of the energy consumption and production, weather conditions, etc.) in a heterogeneous format from multiple sources. We describe the architecture of ECDP, designed to support a Data Integration Workflow and a Data Lake Workflow , conceived for diferent uses of the data, motivating our technological choices. Then, we illustrate several dataflows reflecting real-world use cases, which highlight the advantages ofered by the designed architecture for diferent types of users. The main strengths of the presented big data platform are flexibility and scalability (guaranteed by its modular architecture), which allow its applicability to any type of local energy community.

eol>Big Data Integration Energy Communities Big Data Platform

1. Introduction

two main requirements to be satisfied: (i) it must allow to preprocess the data in a batch and periodical way, The Energy Community Data Platform (ECDP), commis- then to store it, making it available for data analysis sioned by ENEA (the Italian National Agency for New without having to repeat these operations every time; Technologies, Energy and Sustainable Economic Devel- (ii) it must integrate heterogeneous data acquired from opment), is a middleware platform we designed to collect multiple sources with diferent characteristics. To comand analyze big data about the energy consumption in- ply with the first requirement, ECDP exploits an eficient side Local Energy Communities (LEC), with the aim of data lake storage layer, namely Delta Lake [1]. Storing encouraging a conscious use of energy by the users, at the data in Delta Lake allows to mitigate the excessive home and in the workplace. execution time needed to import large files or to im

ECDP is designed for the acquisition of data from dif- port data in bulk from database management systems. ferent dataflows in a heterogeneous format, the proper For the second requirement, the architecture of the big management of the workflows to retrieve and store the data platform is centered on MOMIS (Mediator EnvirOngreat amount of data acquired from diferent sources ment for Multiple Information Sources), an open-source and utilities (of public or private nature), and function- data integration system [2, 3, 4] which adopts a semanalities of data integration, transformation, and cleaning, tic approach to the integration of diferent data sources, required to make the data ready to run queries on it and also making the acquisition of new sources easier. ECDP for its use in data analysis and visualization operations. was developed following the wrapper/mediator architec

The design of the big data platform was driven by ture of MOMIS, which allows to aggregate information several real-world use cases, which allowed to detect from heterogeneous data sources (both structured and semi-structured) and make it homogeneous, in a semiautomatic way. Thus, it makes possible to obtain a single unified source, without any redundancy or coniflct in data.

In Section 2, we describe the architecture of the big data platform, motivating the structural choices and the adopted technologies. The main strengths of the proPublished in the Workshop Proceedings of the EDBT/ICDT 2022 Joint Conference (March 29–April 1, 2022, Edinburgh, UK)

0000-0001-5977-1078 (L. Gagliardelli); 0000-0002-4856-0838 (L. Zecchini); 0000-0001-6616-1753 (D. Beneventano); 0000-0002-3466-509X (G. Simonini); 0000-0001-8087-6587 (S. Bergamaschi); 0000-0002-5087-9530 (M. Orsini)

© 2022 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) posed solution are flexibility and scalability. ECDP was design of the platform was carried out by the Database designed in a modular manner as a flexible and open Research Group (DBGroup5) of the University of Modsystem, to ease future extensions for the implementa- ena and Reggio Emilia, leveraging its experience in big tion of new functionalities, diferent ways for connecting data integration and analysis [5, 6, 7, 8, 9], jointly with to and retrieving data from devices or services, and the DataRiver6, which subsequently took care of its impleintegration with the other components of the software mentation. Both phases complied with the specifications architecture or software systems present inside the LEC. and the technical documentation provided by ENEA and These features allow an extremely wide range of applica- saw a continuous interaction with its project team. tions for ECDP, making it suitable for any type of local energy community.

The flexibility of ECDP is highlighted in Section 3, 2. The Energy Community Data where we illustrate its adaptability to several real-world Platform (ECDP) use cases that drove the design process, addressing different application areas. These use cases are related to This section illustrates the architecture of the Energy projects carried out by ENEA1, namely SelfUser2, aim- Community Data Platform (ECDP), represented in Figing at maximizing the collective self-consumption based ure 1, whose main characteristics and requirements were on renewable energy in a condominium (Section 3.2), presented in Section 1. and PELL3, to optimize the energy consumption of public lighting (Section 3.3). Moreover, also the European 2.1. Data Sources project GECO4 about green energy communities was considered as a reference for the design of the architecture.

Contributions by the groups

The presented big data platform was designed and developed within the project “Smart Monitoring of a Local Energy Community” (2020-2021), supervised by ENEA. The

1https://www.enea.it 2http://www.selfuser.it 3https://www.pell.enea.it 4https://www.gecocommunity.it The three modules identified as "Data Sources" in Figure 1 reflect the existing scenario and the solutions adopted by ENEA, and constitute the basis to build the big data platform on.

Data sources are multiple and heterogeneous, and will certainly increase in number and change over time. The data about the functioning of the LEC can be static or dynamic. Static data is collected once and stored in tables 5https://dbgroup.unimore.it 6https://www.datariver.it/en of a relational DBMS; it can be related to people (e.g., per- them. From the Data Hall server it is possible to access sonal information), buildings (e.g., geopositioning or exte- the second server, called Data Collector, via remote port rior insulation), or apartments/ofices (e.g., air condition- forwarding. This server supports a PostgreSQL10 DBMS, ing). Dynamic data, which is collected by sensors (contin- which represents the reference for structured data inuously or at diferent intervals) or received from outside stead (e.g., the ones related to SelfUser). Moreover, it the LEC (e.g., from energy providers), is mainly related is possible to exploit additional database management to the energy consumption, production, or accumulation. systems for testing which are available on other local In the considered use cases, for example, ECDP has to deal network servers. with data about the energy consumption/production in condominiums and single apartments, measured with a 2.2. Data Workflow granularity of one second and sent weekly in CSV format as ZIP archives (i.e., condominium data) or measured with The module on the right in Figure 1 represents ECDP, a granularity of 15 minutes and sent continuously as CSV the big data platform designed to extract value from this ifles (i.e., smart meter data), but also meteorological data great amount of ingested data. To ensure maximum flexcollected by weather stations with a granularity of one ibility, allowing diferent types of users to operate on minute and stored as JSON files, and many other types data in diferent ways, it is possible to define two speof data, such as information about the presence of the cific workflows (with the related functional modules): (i) personnel in the ofices (in TXT format). a Data Integration Workflow , designed to perform data

Furthermore, the considered projects also present integration, which provides the main access point to the some specific additional data. SelfUser adds to the de- unified data for every application; (ii) a Data Lake Workscribed scenario structured data stored in a PostgreSQL lfow , which allows to store the whole historical data in DBMS, providing information about the energy consump- a raw (i.e., not integrated) form, available to be handled tion, the energy production by photovoltaic panels, and by expert users if needed. Both systems are coordinated weather conditions. This structured data presents a gran- through a Data Workflow Management System , used to ularity of 15 minutes and is obtained by preprocessing define the settings for the operations needed to mansome of the raw data described above. PELL platform age the data workflow (e.g., new input data activating deals with static data (e.g., the position of the lighting triggers/ETL, failure alerts, error-handling). These three devices), stored using a MySQL DBMS, and dynamic data modules are detailed in the following subsections. (e.g., information about the energy consumption, measured by smart meters), stored in HDFS [10] according 2.2.1. Data Integration Workflow: MOMIS to the UrbanDataset data structure7.

The ingestion of the described input data can be managed in diferent ways. The main solution adopted by ENEA is based on the open-source system integration framework Apache Camel8, running in a deeply customized OSGi Apache Karaf9 container, named SignalMix. In this environment, both specific custom components and Camel allow to use domain specific languages (namely, a Java DSL and a Spring XML DSL) to define routes, containing flow and logic of integration between diferent systems, protocols, and formats, supporting most of the enterprise integration patterns. However, also diferent solutions are adopted depending on the source (e.g., some JSON/CSV files are sent directly by the users via e-mail).

ENEA exploits two web servers to store the ingested data. The central element is the Data Hall server, which supports the so-called “Triage Area”, which is the reference for the storage of unstructured data and contains the files (whose format can be CSV, JSON, etc.) as they are acquired from the sources by data ingestion systems, without applying further preprocessing operations on

The chosen data integration tool is MOMIS, and it was a

natural choice, since this open-source system currently managed by DataRiver was designed by the DBGroup and played a central role in its research activities for several years [11, 12].

MOMIS is based on a wrapper/mediator architecture: the wrapper makes available a data source to be integrated, then the mediator performs data fusion [13] to generate in a semi-automatic way a mediated schema, called Global Virtual View (GVV) or Global Schema (GS), of the schemas of the local sources. The user can query this schema (through the query manager or through thirdparty applications) to obtain a complete and unified view of the data contained in the local sources. MOMIS is a virtual data integration system, i.e., the data is retrieved from the sources at query time. This allows to avoid data replication, so that each query returns updated data.

However, it is possible to materialize some global classes (materialized views); this can be useful to optimize the retrieval of frequently used data or complex queries.

In particular, ECDP is based on MOMIS I4.0, the specific industrial IoT extension of MOMIS for Industry 4.0 [4], a web and mobile application designed to efectively 7https://smartcityplatform.enea.it/UDWebLibrary 8https://camel.apache.org 9https://karaf.apache.org 10https://www.postgresql.org collect and manage big data generated by machinery and sensor networks in industrial processes, exploiting artificial intelligence and machine learning techniques. This platform provides tools for the continuous monitoring and advanced services for the real-time analysis of production and quality performance, which allows to learn from experience for predictive maintenance policies, production process optimization, and energy consumption reduction. The technology stack of MOMIS I4.0 is illustrated in Figure 2.

Data ingestion is performed by software modules called wrappers, which allow to connect to data sources, extracting their schemas and features. The interfaces Figure 2: MOMIS Industrial IoT Technology Stack. created through the analysis of these schemas allow to represent the heterogeneous sources in a common lan- Data querying and export services are used to allow auguage, making them homogeneous through data integra- thorized users and applications to access a certain portion tion services. Several wrappers were created for ECDP to of data or aggregate view. The access is managed through obtain the representation of data from diferent dataflows: a role-based access control to guarantee to each type of (i) a version of the CSV wrapper for detecting the ZIP user the right to access the needed information and at archives about condominium data on the Data Hall server the same time the security of the information for which and extracting their content; (ii) a version of the CSV the authorization was not granted. ECDP relies on two wrapper for detecting the files about smart meter data distinct APIs: (i) ExportUD, to query the integrated table on the Data Hall server; (iii) a version of the JSON wrap- of readings and export the values in UrbanDataset-XML, per for detecting the files about meteorological data on UrbanDataset-JSON, or raw CSV format; (ii) SQL_API, to the Data Hall server; (iv) the PostgreSQL wrapper for run the user queries based on the queries defined in the connecting to the DBMS on the Data Collector server application and the related access authorizations. to ingest SelfUser data; (v) the Delta Lake wrapper to in- ECDP also supports a customized script scheduling gest PELL data from the data lake. Moreover, an MQTT, service, which can be used to run scripts on MOMIS an OPC UA, and a WoT wrapper, not used in the final server. This service simply runs the main class of the version, were implemented and initially considered to script (in Bash or MATLAB language) and is extremely directly ingest data from smart meters. lfexible, since it can use the GUI to collect diferent scripts

The software module for data integration exploits the and to define the configuration for their scheduling. MOMIS mediator to obtain the semantic representation Finally, the source management and storage services of data sources, allowing to create an ontology of the allow to monitor the data collected and processed by domain, called Renewable Energy Community (REC) On- the other software modules. In particular, the source tology. The semantic integration is used to perform the management dashboard was realized using MOMIS data mapping between data sources and the mediated schema. analytics module (i.e., MOMIS Dashboard). It supports a Novel services to support data integration were devel- unified view of the business data integrated with external oped for ECDP, allowing to intervene on dataflows with data sources, searching and monitoring aggregate data transformation operations such as applying mathemati- from distributed and heterogeneous data sources, visualcal operators, computing average/minimum/maximum izing indicators on charts and dynamic tables, managing values, performing time conversion to diferent formats security and visibility of data based on roles and user (e.g., UNIX format) and time-based missing value im- groups. putation, adding or removing tuples according to data timestamp. 2.2.2. Data Lake Workflow: Delta Lake

ECDP exploits a hybrid storage, based on two solutions for diferent uses: (i) Delta Lake storage layer, used to The estimated size of data and the available resources store big data for ofline data historicization and analy- allow to store raw data (i.e., data as it is acquired, without sis (if the data timestamp has passed the retention time, applying any preprocessing operation on it) in a data lake the data is historicized in Delta Lake and removed from which lays on HDFS. The main goal of this workflow is to PostgreSQL); (ii) PostgreSQL relational DBMS optimized support data analysis performed by advanced users who through PostGIS and TimescaleDB extensions for query- need to access raw data to retrieve additional information ing temporal series, used to answer most frequent queries that cannot be obtained from integrated data. For examon recent data, and containing synthesis of the data his- ple, raw sensor data collected with a granularity of one toricized in Delta Lake. second is stored in the data lake, while in Data Integra

2.2.3. Data Workflow Management System: Apache Airflow

tion Workflow it is aggregated according to a granularity quirements, we chose Apache Airflow 13, which allows to: of 15 minutes. Thus, if an advanced user needs data at (i) manage task workflows through simple Python scripts the finest granularity, it is necessary to exploit Data Lake defining the dependencies among tasks, whose execution Workflow. The chosen technology for the data lake is order can be represented with a directed acyclic graph Delta Lake [1], an open-source project which allows to (each module is a black box, so it can be executed by any manage a great amount of data using existing storage tool and implemented with any programming language); tools such as HDFS. Delta Lake is integrated with the (ii) monitor the workflow (even through a GUI) and manwhole Apache Spark11 ecosystem, allowing the native use age the possible errors generated by tasks, using scripts of all its libraries (e.g., MLib for machine learning) and to to define the trigger rules. execute eficient elaborations on data both in batch and in streaming mode. The possibility of exploiting this powerful engine to operate on raw data was one of the core 3. Use Cases and Scenarios reasons for the choice of Delta Lake, also considering the significant role played by Spark in the DBGroup research activity [14, 15, 16]. Delta Lake is a mature device, used by multiple cloud service providers such as Databricks and Microsoft Azure, which supports ACID transactions, guaranteeing the highest level of isolation (this allows to execute at the same time read and write operations without issues about data integrity). Delta Lake exploits Apache Parquet12, an open-source format which operates data compression, requiring much less space for data storage than the one required by the original format (usually CSV). Moreover, it supports data versioning, tracking the operations executed on data through a system of logs and allowing to execute rollback, it allows to modify data schema, and it handles metadata as it were data.

In this section we report three dataflows related to realworld use cases that show how ECDP can be used in diferent application scenarios. In particular, we want to illustrate: (i) how unstructured data is ingested (Section 3.1); (ii) how structured data is managed (Section 3.2); (iii) how it is possible to use Spark to directly import data in the platform (Section 3.3).

The chosen dataflows let us highlight the advantages and the flexibility ofered by the coexistence of the two distinct workflows, using a data lake to store the entire raw data (i.e., Data Lake Workflow ) in combination with a structured database to store the integrated data (i.e., Data Integration Workflow ). In fact, this solution allows the standard users to access an integrated and clean view of the data (e.g., they can see in a dashboard the hourly energy production by the photovoltaic panels related to the weather conditions), while advanced users can use the raw data to perform advanced tasks (e.g., a data scientist can use the data about the energy production by the photovoltaic panels collected at the finest granularity combined with the solar radiation collected by the sensors to train a machine learning model to predict the energy production).

The goal of the Data Workflow Management System is

the orchestration of the diferent functional modules. In particular, when new data is made available from the sources, it has to provide the mechanisms to automatically identify it and to raise the appropriate workflows.

Furthermore, it has to support the handling of the presence of unexpected errors and it must allow to compose 3.1. Data Hall Dataflow workflows by combining diferent software modules. The chosen tool has to be compatible with these modules, Data Hall is the web server managed by ENEA on which programmable, and simple to use. Considering these re- is located the so-called "Triage Area" (see Figure 1), that hosts unstructured raw data (e.g., JSON and CSV files). 11https://spark.apache.org 12https://parquet.apache.org 13https://airflow.apache.org before, and stores it in a PostgreSQL database optimized to manage time series (Figure 4-1). Since this data was already preprocessed and aggregated, no further transformations are needed and can be made available to external data analysis services, such as MOMIS Dashboard and REST APIs (Figure 4-2). Again, the integrated data is periodically moved from PostgreSQL to Delta Lake (Figure 4-3), where advanced users can operate on it using Spark (Figure 4-4). 3.3. Delta Lake Dataflow The web server mainly contains data of the SelfUser Public Energy Living Lab (PELL) is a platform developed project, which aims to test the Clean Energy Package by ENEA to collect data about the energy consumption directive by creating an innovative plant, in a pilot form, of public lighting, which represents a key task for urban to support the energy transition through the promotion renewal. The platform uses MySQL to store static inforof energy produced by renewable sources (in this case, mation about the lighting devices (e.g., the position and photovoltaics). The data, such as the energy consump- other technical details) and HDFS to store the consumption, the energy production by photovoltaic panels, and tion of each device (dynamic data). weather conditions, is collected from several sensors The dataflow is described in Figure 5. The static and dyplaced in smart buildings. The data can be considered namic data collected from the PELL server is combined by as a time series with a diferent time granularity: the en- using a Spark script that produces a DataFrame which is ergy consumption/production is measured every second, stored into Delta Lake (Figure 5-1). From Delta Lake, the while the weather information every minute. data can be directly queried by advanced users through

The dataflow is described in Figure 3. Firstly, MOMIS the connectors provided by Spark (Figure 5-4) or it can be (Figure 3-1) acquires the new data from the Data Hall processed by MOMIS (Figure 5-2). MOMIS reads the data server by using its wrappers and stores it into Delta Lake, from Delta Lake, performs data integration operations, where it can be used for further analysis by directly query- and stores the data in a PostgreSQL database equipped ing the data lake with the connectors provided by Spark with extensions to manage time series; in fact, the data (Figure 3-4). Then, raw data is processed by MOMIS about the energy consumption can be seen as a time seto clean and integrate it. The resulting integrated data ries, since every record is associated with a specific time. is materialized in a PostgreSQL database with PostGIS The users can access the integrated data by querying the and TimescaleDB extensions: this data can be consid- database through a dashboard with predefined views, or ered as a time series with also geographic coordinates, by using REST APIs (Figure 5-3). Periodically, after a prereflecting the position of the sensors (Figure 3-2). The defined retention time, to avoid losing in performance integrated data is made available for external services the integrated data is moved from PostgreSQL to Delta (e.g., MOMIS Dashboard, REST APIs) that can be used to Lake (Figure 5-5). analyze the most recent data (Figure 3-3). Then, after a predefined retention time, the integrated data is moved from PostgreSQL to Delta Lake (Figure 3-5) to avoid los- 4. Conclusions ing in performance, creating an historical view of the data. 3.2. Data Collector Dataflow Data Collector is the web server managed by ENEA that hosts the PostgreSQL database containing the structured data of the SelfUser project. The database stores a cleaned and aggregated version of the raw data about the energy consumption, the energy production, and weather conditions; all measurements are aggregated at intervals of 15 minutes.

The dataflow to import this data in ECDP is described in Figure 4. MOMIS connects to the Data Collector server by using the apposite wrapper, acquires the data never read

We presented the Energy Community Data Platform

(ECDP), a big data platform designed for the smart monitoring of local energy communities, to encourage a more conscious use of energy by the users. For this purpose, ECDP can be useful both for the administrators of the energy communities, allowing a better monitoring of the community performance and a classification of the energy consumption profiles, and for the users, which can set energy (self-)consumption targets and receive feedback about their adherence to these plans, adapting their behavior accordingly in a conscious way. The modular architecture of ECDP, conceived to support diferent uses of data by diferent types of users, has the goal of maximizing flexibility and scalability. As illustrated through

Acknowledgements The project was funded by the Italian Ministry of Eco

nomic Development as a part of the 2019-2021 National Electricity System Research Plan. real-world use cases, these features represent the main strengths of the presented big data platform and make ECDP suitable for being applied to any type of local energy community.

The main lesson that we learned from this project, which makes our experience relevant and reusable for related tasks, is that maintaining the workflows separated through a modular architecture allows to combine the strengths of each adopted tool, guaranteeing the availability of data at diferent abstraction levels, making the system more flexible to better support future changes, and meeting the needs of the diferent types of users. The data lake (Delta Lake) can store in an eficient way a huge amount of raw data, allowing to perform further analysis operations on it. The data integration system (MOMIS) can clean and integrate the raw data and store it into the relational DBMS (PostgreSQL with TimescaleDB extension). The relational DBMS can be used as a cache to store materialized views of the most recent integrated data (old ones are moved to the data lake) fastening the access by the data analytics tools (MOMIS Dashboard). A single workflow cannot guarantee this flexibility: directly integrating and storing the whole data in a relational DBMS would not allow to exploit raw data to perform new types of analysis in the future, while relying only on Delta Lake raises significant eficiency issues, since its Spark-based interaction requires to load the whole dataframe in memory every time. Abstracting from the specific case, this lesson can be useful for every project dealing with the management and the analysis of time series, proposing an approach to face this challenging task for big data platform design and implementation. for the Lexical Annotation of Domain Ontologies,

Int. J. Semantic Web Inf. Syst. 3 (2007) 57–80. [4] L. Magnotta, L. Gagliardelli, G. Simonini, M. Orsini,

S. Bergamaschi, MOMIS Dashboard: A Powerful Data Analytics Tool for Industry 4.0, in: TE 2018, volume 7 of Advances in Transdisciplinary Engineering, IOS Press, 2018, pp. 1074–1081. [5] G. Simonini, S. Bergamaschi, H. V. Jagadish, BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution, Proc. VLDB Endow. 9 (2016) 1173–1184. [6] G. Simonini, G. Papadakis, T. Palpanas, S. Bergamaschi, Schema-Agnostic Progressive Entity Resolution, in: ICDE 2018, IEEE Computer Society, 2018, pp. 53–64. [7] G. Simonini, L. Zecchini, S. Bergamaschi, F. Naumann, Entity Resolution On-Demand, Proc. VLDB

Endow. 15 (2022). [8] G. Papadakis, L. Tsekouras, E. Thanos, N. Pittaras,

G. Simonini, D. Skoutas, P. Isaris, G. Giannakopoulos, T. Palpanas, M. Koubarakis, JedAI3: beyond batch, blocking-based Entity Resolution, in: EDBT 2020, OpenProceedings.org, 2020, pp. 603–606. [9] F. Guerra, G. Simonini, M. Vincini, Supporting

Image Search with Tag Clouds: A Preliminary Approach, Adv. Multim. 2015 (2015) 439020:1– 439020:10. [10] D. Borthakur, HDFS Architecture Guide, Hadoop

Apache Project (2008). [11] S. Bergamaschi, D. Beneventano, F. Guerra,

M. Orsini, Data Integration, in: Handbook of Conceptual Modeling, Springer, 2011, pp. 441–476. [12] S. Bergamaschi, D. Beneventano, F. Mandreoli,

R. Martoglia, F. Guerra, M. Orsini, L. Po, M. Vincini, G. Simonini, S. Zhu, L. Gagliardelli, L. Magnotta, From Data Integration to Big Data Integration, in: A Comprehensive Guide Through the Italian Database Research, volume 31 of Studies in Big Data,

Springer International Publishing, 2018, pp. 43–59. [13] J. Bleiholder, F. Naumann, Data fusion, ACM Com

put. Surv. 41 (2008) 1:1–1:41. [14] G. Simonini, L. Gagliardelli, S. Bergamaschi, H. V. Ja[1] M. Armbrust, T. Das, S. Paranjpye, R. Xin, S. Zhu, gadish, Scaling entity resolution: A loosely schemaA. Ghodsi, B. Yavuz, M. Murthy, J. Torres, L. Sun, aware approach, Inf. Syst. 83 (2019) 145–165. P. A. Boncz, M. Mokhtar, H. V. Hovell, A. Ionescu, [15] L. Gagliardelli, G. Simonini, D. Beneventano, A. Luszczak, M. Switakowski, T. Ueshin, X. Li, S. Bergamaschi, SparkER: Scaling Entity ResoluM. Szafranski, P. Senster, M. Zaharia, Delta Lake: tion in Spark, in: EDBT 2019, OpenProceedings.org, High-Performance ACID Table Storage over Cloud 2019, pp. 602–605.

Object Stores, Proc. VLDB Endow. 13 (2020) 3411– [16] L. Gagliardelli, S. Zhu, G. Simonini, S. Bergamaschi, 3424. BigDedup: A Big Data Integration Toolkit for Du[2] S. Bergamaschi, S. Castano, M. Vincini, Semantic plicate Detection in Industrial Scenarios, in: TE Integration of Semistructured and Structured Data 2018, volume 7 of Advances in Transdisciplinary Sources, SIGMOD Rec. 28 (1999) 54–59. Engineering, IOS Press, 2018, pp. 1015–1023. [3] S. Bergamaschi, P. Bouquet, D. Giacomuzzi,

F. Guerra, L. Po, M. Vincini, An Incremental Method