Exploring energy performance certificates
                                   through visualization

         Tania Cerquitelli∗ , Evelina Di Corso∗ , Stefano Proto∗ , Alfonso Capozzoli† , Fabio Bellotti∗ ,

        Maria G. Cassese∗ , Elena Baralis∗ , Marco Mellia‡ , Silvia Casagrande§ , Martina Tamburini§
                      ∗ Department of Control and Computer engineering, Politecnico di Torino, Torino, Italy
                                         † Department of Energy, Politecnico di Torino, Torino, Italy
                     ‡ Department of Electronics and Telecommunications, Politecnico di Torino, Torino, Italy
                                                               § Edison Spa, Torino, Italy


                                                             ∗ † ‡ name.surname@polito.it
                                                              § name.surname@edison.it


ABSTRACT                                                                           To enhance the effectiveness of data and knowledge explo-
Energy Performance Certificates (EPCs) provide interesting infor-               ration, a variety of data visualization techniques have been pro-
mation on the standard-based calculation of energy performance,                 posed. In [22, 23, 26] the authors exploited choropleth maps to
thermo-physical and geometrical related properties of a build-                  analyze the energy consumption and the electricity consump-
ing. Because of the volume of available data (issued as open                    tion per unit area, respectively. Instead, in [21], the authors used
data) and the heterogeneity of the attributes, the exploration of               dynamic simulations of building energy consumption and build-
these energy-related data collection is challenging. This paper                 ing information to develop urban energy maps with high spatial
presents INDICE (INformative DynamiC dashboard Engine), a                       resolutions. However, all the above works proposed static maps
new data visualization framework able to automatically explore                  to analyze the average values of some features of interest. The
large collections of EPCs. INDICE explores EPCs through both                    exploitation of dynamic and navigable maps tailored to the anal-
querying and analytics tasks, and intuitively presents the output               ysis of energy-related data has not been proposed so far. The
through informative dashboards. The latter include dynamic and                  authors in [24] propose an interactive 3D visualization to analyze
interactive maps along with different informative charts allow-                 the Linking Open Data (LOD) cloud adopting the metaphor of
ing different stakeholders (e.g., domain and non-domain expert                  urban area. The visualization is interactive, meaning that the
users) to explore and interpret the extracted knowledge at dif-                 user can enlarge any part of the model, modify the perspective,
ferent spatial granularity levels. The objective of INDICE is to                change the shape of the buildings and their positioning, view all
create energy maps useful for the characterization of the energy                the connections or only those belonging to a specific data set. A
performance of buildings located in different areas. The exper-                 parallel research effort has been devoted to explore and summa-
imental evaluation, performed on a real set of EPCs related to                  rize geolocated time series data through maps [8]. Moreover, a
a major Italian region in the North West of Italy, demonstrates                 great research effort has been done in [17], in which the authors
the effectiveness of INDICE in exploring an EPC dataset through                 propose a city energy model based on the requests and need for
different data and knowledge visualization techniques.                          visualization from a group of energy consultants. Their proposed
                                                                                model offers stakeholders a powerful tool for evaluating both the
                                                                                current state and future scenarios.
1    INTRODUCTION                                                                  This paper presents INDICE (INformative DynamiC dashboard
Nowadays large volumes of energy-related data are continuously                  Engine), a data visualization framework generating interactive
collected in different domains. To reduce wasteful energy con-                  and navigable dashboards through the analysis of a set of Energy
sumption, several orthogonal applications (e.g., buildings, IoT-                Performance Certificates (EPCs). An EPC is a legal requirement
based devices, wireless networks) increased their policy priority               when constructing, selling or renting a building, and it provides
on energy efficiency. According to the U.S. Department of En-                   interesting information on the calculated standard energy perfor-
ergy, in industrialized countries more than 40% of total energy is              mance, thermo-physical and geometrical properties of existing
consumed in buildings [14]. In the last few years many efforts                  buildings. The multi-tiered framework INDICE has been pro-
have been devoted to improve building energy efficiency with dif-               posed to effectively deal with large collection of EPCs. With
ferent final goals: (i) facilitating proactive energy-saving services           respect to the other works, our framework brings together many
[32], (ii) characterizing data streams of energy consumption of                 different analysis techniques to help non-expert users make sense
individual residential consumers in buildings [5–7], (iii) charac-              of Energy Performance Certificates. Indeed, after a pre-processing
terizing heating energy demand through the analysis of energy                   step, cluster analysis allows discovering groups of EPCs with sim-
performance certificates of buildings [4, 9, 11], and (iv) reducing             ilar features. To summarize the energy performance of buildings
emissions and energy consumption for buildings [20].                            at different granularities, INDICE generates informative dash-
                                                                                boards tailored to different energy stakeholders, combining both
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings   a rich set of interesting knowledge and ease of use.
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                       The proposed informative dashboards exploit different kinds
                                                                                of energy maps to show data and knowledge at different spatial
granularity levels. The proposed visualization techniques allow        attributes, INDICE includes a multi-step algorithm to correctly
different energy stakeholders to easily capture the high-level         reconstruct and correct the wrong information. Specifically, it
overview of heating energy demand at a city level, and drill-          compares the available addresses with a referenced street map
down the knowledge to the single apartment. Moreover, in order         that is usually available for each city. The referenced street map
to analyze the energy efficiency of different buildings through        should contain all the detailed information on streets, including
the most interesting attributes under analysis, INDICE includes        street names, house numbers, ZIP Code and geolocation (i.e.,
cluster-markers which dealing with the problem of representing         latitude and longitude). Given a city under analysis, INDICE au-
multiple variables at the same time.                                   tomatically downloads the referenced street map if it is available
   As a case study, a real collection of EPCs related to a major       online.
Italian region, in North West Italy, was analyzed. Preliminary            The referenced street map is exploited by INDICE to verify the
experimental results show that the proposed approach is effective      reliability of the addresses in the dataset under analysis to correct
in visualizing a manageable set of human-readable knowledge            errors in the address field and at the same time reconstruct miss-
for each end-user thought dynamic and interactive maps.                ing or incorrect information in the attributes ZIP Code, house
   The next sections of the paper are organized as follows. Sec-       address, latitude and longitude. Specifically, the developed algo-
tion 2 introduces an overview of the INDICE system with a thor-        rithm compares each string in the dataset with the ones in the
ough description of its main building blocks. Section 3 discusses      referenced street map. For each couple of addresses Levenshtein
the preliminary experimental results obtained on a real data col-      distance [19] is computed to evaluate the similarity between two
lection and Section 4 draws conclusions and presents the future        character strings, in terms of the minimum number of modi-
development of this work.                                              fications (insertions, deletions and substitutions) necessary to
                                                                       transform the first string into the second one. The similarity
2     THE INDICE ANALYTICS SYSTEM                                      computed from Levenshtein distance takes values in the range
                                                                       [0-1], where 0 indicates total dissimilarity and 1 equality of the
                                                                       compared strings. Given a user-defined threshold ϕ, the refer-
                                                                       enced address (the most similar to the address under analysis)
                                                                       replaces the original one if Levenshtein similarity between the
                                                                       two addresses is greater than or equal to ϕ. When the association
                                                                       to a referenced address is not possible, i.e., Levenshtein simi-
                                                                       larities are below ϕ, a geocoding request is sent via the Google
                                                                       Geocoding APIs1 . The latter is a reliable service providing a tex-
                                                                       tual address to reconstruct the whole address in a consistent way.
                                                                       However, INDICE exploits the Google Geocoding service only
                                                                       when the association cannot be resolved through the referenced
              Figure 1: The INDICE framework                           street map due to a limit on the number of free requests.
                                                                          2.1.2 Outlier detection and removal.
   INDICE (INformative DynamiC dashboard Engine) has been              An outlier is an extreme value that deviates from other obser-
tailored to analyze any collection of EPCs. The analysis of this       vations on data. It may occur either when the collected value
kind of data is challenging, due to the large number of attributes     does not fit the model under study or when some error happens
characterizing each energy performance certificate. The exploita-      during the data collection phase. To address this issue, INDICE ex-
tion of this high dimensional data is burdensome due to the high       ploits three approaches: (i) univariate outlier detection, (ii) mixed
variability and dimensionality of data. INDICE combines different      univariate analysis, and (iii) multivariate outlier detection. In-
techniques to effectively visualize a rich set of knowledge items      dependently of the above adopted strategies, values labelled as
for a variety of energy stakeholders. The overall architecture is      outliers are not considered in the subsequent steps of analysis.
shown in Figure 1. INDICE includes three main building blocks,            Univariate outlier. INDICE integrates three methodologies
each one addressing one of the main steps of the knowledge-            to automatically detect outliers and remove them for the subse-
extraction process: (i) Data pre-processing, (ii) Data selection and   quent analytics steps: (i) the graphic boxplot method, (ii) the para-
analytics, and (iii) Data and Knowledge visualization. In the fol-     metric generalized Extreme Studentized Deviate (gESD) method
lowing, a detailed description of each building block is given.        [27] and (iii) the non-parametric Median Absolute Deviation
                                                                       (MAD) [15]. The boxplot [31] (aka whiskers plot) is a conve-
2.1    Data Pre-processing                                             nient way of visually displaying a data distribution through its
The INDICE pre-processing phase aims at smoothing the effect of        quartiles. The frequency distribution of each variable is summed
possibly unreliable data. It performs two tasks which have been        up through a few numbers (i.e. median, quartiles, min and max
proved to be crucial in real-world geospatial data: (i) geospatial     values). The median summarizes the central tendency of the dis-
coordinates cleaning and (ii) outlier detection and removal.           tribution, while the quartiles give an indication of the variability
                                                                       through the interquartile difference. The minimum and maxi-
   2.1.1 Geospatial data cleaning.
                                                                       mum values provide not only information about extremes but
This pre-processing step is crucial when the final aim is to dis-
                                                                       also on the possible presence of data with abnormal characteris-
play data and knowledge through maps. INDICE includes an
                                                                       tics w.r.t. the other points, plotting them individually. For each
ad-hoc strategy to clean geospatial attributes, including address,
                                                                       variable, the analyst can manually remove the outliers (i.e., the
house number, ZIP Code, latitude and longitude. Since the ad-
                                                                       values smaller and greater than the minimum and the maximum)
dress attribute is usually collected as a free text field, it often
                                                                       through value filters.
contains numerous typos and input errors, which require care-
ful analysis to be correctly fixed. To clean the above-mentioned       1 https://developers.google.com/maps/documentation/geocoding/intro
The gESD method [27] is used to detect one or more outliers             data mining algorithms (e.g., cluster analysis, association rules)
in a univariate data set. This test needs a parameter which is          data have to be properly transformed and peculiar attributes
the upper bound on the number of potential outliers. INDICE             have to be selected. Several techniques have been used to reduce
tests the null hypothesis that the data has no outliers versus the      the complexity of the datasets under analysis and discover effec-
alternative hypothesis that there are at most k outliers (for some      tive and hidden knowledge, interesting and readable by all the
user specified value of k). Given the upper bound, k, the gESD          different stakeholders involved in the analysis. This component
test essentially performs k separate tests: a test for one outlier, a   includes two innovative engines: (i) the query engine and (ii) the
test for two outliers, and so on up to k outliers. In INDICE the        data analytics engine.
number of outliers is determined by finding the largest value
                                                                            2.2.1 Querying engine.
r (with r ≤ k), such that the corresponding test gives a value
                                                                        To select and explore the dataset under analysis, INDICE im-
higher than the critical one.
                                                                        plements a query engine that lets the user focus on the single
Lastly, in statistics the MAD method[15] is a robust measure of
                                                                        attributes of the energy performance certificates. Possible stake-
the variability of a univariate sample of quantitative data. Calcu-
                                                                        holders may be citizens, public administration and energy scien-
lating the MAD is straightforward, as it only involves finding the
                                                                        tists. Each of them could be interested in different characteris-
median of absolute deviations from the median. It is calculated
                                                                        tics of the dataset under analysis. For each stakeholder, INDICE
by taking the absolute difference between each point and the
                                                                        produces the best possible representation to highlight the main
corresponding median, and then calculating the median of those
                                                                        interesting facets of the results. Citizens could be interested in the
differences. As proposed in [16], INDICE uses the score of 3.5 as
                                                                        energy analysis of the buildings related to a specific area of the
cut-off value. This means that every point with a score above 3.5
                                                                        city, or in the geometric features that characterize the buildings
is considered an outlier.
                                                                        belonging to the same intended use. The citizens may want to
   The users can exploit all the different univariate methodologies
                                                                        discover areas of the city with more performing buildings, to buy
and/or choose the most suitable one. If a non-expert user does
                                                                        a flat that performs well in terms of energy efficiency. The public
not know how to deal with these outlier detection techniques,
                                                                        administration may be instead being interested in identifying ar-
she can use default configurations, as described below.
                                                                        eas where to promote and invest for energy renovations. Energy
   Expert-driven univariate analysis. Because some non-
                                                                        scientists could use INDICE to explore and characterize through
expert users may be interested in analysing EPC collections,
                                                                        supervised and unsupervised techniques groups of building with
INDICE suggests the univariate outlier detection method mostly
                                                                        similar properties to perform benchmarking analysis. Based on
used by domain experts in the past interactions with INDICE.
                                                                        the target of each stakeholder, the system is able to automatically
Specifically, by collecting and storing expert user (e.g., energy
                                                                        propose to the specific end-user an optimal set of interesting
scientists) INDICE configurations, the non-expert users can re-
                                                                        reports and graphical representations, with the possibility to set
ceive interesting and effective suggestions to properly deal with
                                                                        manually the subset of features and parameters for the queries
noisy data. In the current version of INDICE, only relevant at-
                                                                        to which she is interested in.
tributes describing the building thermo-physical characteristics
(e.g., Aspect Ratio, Average U-value of the vertical opaque enve-          2.2.2 Data analytics engine.
lope and Average U-value of the windows) and the efficiency of          To extract meaningful and interesting knowledge items from data,
the heating subsystems (e.g., Distribution Subsystem Efficiency         INDICE includes different supervised and exploratory algorithms
and Generation Subsystem Efficiency) have been considered. In           to automatically analyze feature subsets. INDICE integrates the
this way, if a non-expert user does not know which univariate           K-means clustering algorithm [18] to create groups of buildings
analysis technique should be used, she can use a configuration          with similar thermo-physical and energy properties, and associ-
adopted by previous INDICE expert users, since their choices            ation rule mining [1] to extract interesting correlations among
are automatically stored as default configurations for non-expert       features.
users.                                                                     K-means algorithm. The partitional K-means cluster algo-
   Multivariate outlier detection. For the multivariate outlier         rithm [18] is exploited by INDICE to identify groups of EPCs
detection, INDICE integrates the DBSCAN algorithm (Density-             characterized by similar properties. To measure the similarity be-
Based Spatial Clustering of Application with Noise) [12] to auto-       tween EPCs, the Euclidean distance is computed. The K-means al-
matically identify outliers. Specifically, DBSCAN detects clusters      gorithm, which is the most popular clustering algorithm, divides
based on a density reachability concept, where clusters with            the input dataset into K groups, where K is defined a-priori. The
higher-density regions are separated by lower-density regions.          average of all the energy certificates in each cluster represents
DBSCAN requires two user-defined parameters (i.e., minPoints            the centroid (representative point) of each group of buildings.
and Epsilon). To properly specify these input parameters INDICE         First, the algorithm chooses randomly K initial centroids. Then,
plots the k-distance graph and automatically estimates a good           each point is assigned to the closest centroid and the centroids are
value for each parameter. As proposed in [10], INDICE runs sev-         recalculated. The previous steps are repeated until the centroids
eral times the k-distance plot for different values of minPoints,       no longer change. K-means is able to identify a good cluster set
and selects minPoints when the curve stabilises, and Espilon as         in a limited computational time. INDICE analyses the trend of
the elbow point of the stable curve.                                    the SSE (saReadum of squared error) quality index to evaluate the
                                                                        cluster cohesion [30] and automatically identify possible good K
                                                                        values. The SSE is computed as the total sum of squared errors
2.2    Data selection and analytics
                                                                        for all objects in the collection, where for each object the error is
The knowledge visualization step is preceded by a data selection        computed as the squared distance from the closest centroid. As
and analytics phase. Since each energy performance certificate          done in [30], in INDICE the K value is chosen as the point where
includes a large number of features characterized by a great vari-      the marginal decrease in the SSE curve is maximized (aka elbow
ability, in order to extract accessible knowledge and implement         approach).
   Association rules. One of the most powerful exploratory                    of statistics and generally difficult to interpret, since the geo-
techniques in data mining aiming at finding interesting correla-              localized EPC data lend themselves very well to be visualized
tions among data is represented by association rule discovery [1].            on maps, INDICE proposes several techniques to explore and
An association rule is expressed in the form A → B, where A and               visualize the knowledge extracted from EPCs.
B are disjoint and non-empty itemsets, (i.e., A ∩ B = ∅). A is also           The dashboards include (i) geospatial maps, including traditional
called rule antecedent and B rule consequent. Since association               maps as choropleth and scatter maps and a new type of map
rules extraction operates on a transactional dataset of categorical           named cluster-marker map, (ii) frequency distribution plots, (iii) as-
attributes, a discretization step is needed to convert the original           sociation rules, and (iv) correlation matrices. These visualization
continuously-valued measurements into categorical bins. The                   techniques are jointly exploited by INDICE to graphically show
discretization adopted in INDICE are described in [11]. The used              the extracted knowledge at different spatial granularity levels
technique involves creating a decision CART (Classification And               such as city, district, neighbourhood, or housing unit (e.g., cer-
Regression Tree) [2] for each variable, using as response variable            tificates belonging to the same building).
the annual primary energy demand normalized on the floor area.                    Geospatial maps. In INDICE, three geospatial maps have
The tree splits are used as bins in the discretization process. To            been integrated: (i) choropleth maps, (ii) scatter maps, and (iii)
select only a subset of interesting rules, constraints on various             cluster-marker maps. These energy maps are related to each other,
goodness measures are used. INDICE includes four well-known                   as each user can switch from one view to another, simply by
quality indices: i) support, ii) confidence, iii) lift, and iv) conviction.   changing the analysis zoom (i.e., drill down in the energy map)
The rule support is the percentage of transactions that contain               or introducing the knowledge of the cluster-markers. In choro-
both antecedent and consequent; confidence is the conditional                 pleth maps each area (at different zoom levels) is colored ac-
probability that the consequent is true under the condition of the            cording to the average value of the considered variable for the
antecedent; lift [30] measures the correlation between the an-                area under analysis. The scatter maps report a point and its
tecedent and the consequent; conviction [3] measures the degree               corresponding value for each EPC (and so residential unit) con-
of implication of a rule. Default thresholds are set by INDICE                tained in the selected area. Cluster-marker maps, similarly to
however the end-user could change the default values to analyze               the choropleth maps, aggregate multiple certificates coloring the
at different granularity level the extracted rules.                           dynamic markers according to the average of the values of the
                                                                              aggregated points. While the first two geospatial maps (i.e., choro-
2.3     Informative dashboard                                                 pleth and scatter maps) are useful for analyzing single variables,
                                                                              the cluster-marker visualization faces the problem of represent-
                                                                              ing multiple variables at the same time. Specifically, exploring
                                                                              a single variable at coarse granularity levels could lead to flat
                                                                              and poor representative maps. To this extent, INDICE includes
                                                                              cluster-markers to introduce a new feature to the maps, in order
                                                                              to analyze the energy efficiency of several buildings through
                                                                              various attributes. The cardinality of the corresponding cluster
                                                                              affects the size of the marker and is reported inside the marker.
                                                                              These maps have been used together, ensuring in a single solution
                                                                              different levels of detail depending on the zoom degree selected
                                                                              by the user. Figure 2 shows examples of analysis results at differ-
                                                                              ent granularity levels, visualizing various information features
                                                                              on the maps. In the upper part of Figure 2, a set of attributes
                                                                              (i.e., the Average U-value of the vertical opaque envelope and the
                                                                              Average U-value of the windows, see Section 3 for further attribute
                                                                              details) extracted from the EPCs by means of the querying engine
                                                                              has been displayed. The choropleth map shows the average value
                                                                              of the attributes for the selected area together with the scatter
                                                                              marker of each single point, visualized at neighbourhood and
                                                                              housing unit zoom levels, respectively. The users can navigate
Figure 2: Example of choropleth and scatter map at single                     the map and check the attribute values for each certificate by
certificate (Upper Left) and neighbourhood level (Upper                       clicking on the markers. In the bottom part of Figure 2, the in-
Right), and Cluster-marker maps at district (Bottom Left)                     formation obtained through the data analytics engine (e.g., the
and city levels (Bottom Right).                                               identification of the areas characterized by lower and medium
                                                                              energy performances) has been visualized at district (Left) and
                                                                              city (Right) levels. The cluster-markers show the cardinality of
   The aim of this component is to visualize and make the infor-              each cluster, together with the average value of an independent
mation and the extracted knowledge easy to be interpreted at                  response variable chosen in the analytic process.
different levels of detail. To this extent, INDICE includes interac-              Frequency distribution plots. For a given area, the frequency
tive and navigable dashboards tailored to different use cases, pro-           distributions (e.g., quartiles or deciles) of the features selected
viding both domain specific information and high-level energy                 for the visualization task are reported. A frequency distribution
demand overviews. Indeed, the dashboards can be customized for                of data can be shown in a table or graph/diagrams. Some com-
each end-user, providing deep targeted knowledge for domain                   mon methods include frequency tables, histograms or bar charts.
experts and human-readable informative contents for non-expert                These distributions can refer to single attributes or to aggregate
users. Besides displaying charts and diagrams, which are typical              information extracted from the analytic task, hence to groups of
similar certificates according to the subsets of attributes selected
for the analysis. INDICE provides a setting panel to select one or
more distribution visualizations, including the description of the
main statistical indices. For numeric data, INDICE includes count,
mean, standard deviation and the three quartiles (i.e., median, first
and third quartiles), while for categorical attributes, the count,
the most common value’s frequency (i.e., mode) and the top-k
frequent values are reported. The end-user can select a response
variable against which to color the attribute distributions.
   Association rules. INDICE discovers correlations in terms
of association rules. However, to ease the manual inspection of
the most interesting correlations, INDICE defines templates to
characterize the attributes and represent the association rules
using a tabular visualization. By sorting on quality indices, only
                                                                         Figure 3: Correlation matrix between pairs of numerical
the top-k rules that satisfy all constraints may be displayed. Rules
                                                                         attributes
can be extracted at different granularity levels, e.g., for each city,
neighbourhood or downstream of the clustering algorithm.
   Correlation matrices. To reduce the complexity of the anal-
                                                                         the PA user is interested in discovering which areas of a city are
ysis and remove correlated attributes from the analytic process,
                                                                         more energy consuming and which are more efficient, she could
INDICE proposes correlation matrices to analyze the dependence
                                                                         select the following subset of attributes, which characterize the
between variables. For each pair of numerical attributes X and Y,
                                                                         thermo-physical properties of each building: Aspect Ratio (S/V),
the framework computes the Pearson correlation coefficient [28],
                      cov(X,Y )                                          Average U-value of the vertical opaque envelope (Uo ), Average U-
defined as ρ X,Y = σX σY where cov(X , Y ) is the covariance             value of the windows (Uw ), Heat surface (Sr ) and Average global
between X and Y , σX is the standard deviation of X and analo-           efficiency for space heating (ETAH). The Aspect Ratio represents
gously σY for Y . Each coefficient value is translated into a gray       the geometric shape of a building. Uo and Uw measure the heat
level in the black-and-white scale to represent the correlation          loss through the opaque and the transparent elements of the
intensity in a plot matrix. When the selected set of attributes has      building, respectively. The lower the thermal transmittance of
no evident linear correlation, it is eligible for the analytic task.     the building envelope, the lower the heat flow that is transmitted
                                                                         through the elements themselves. The Heat surface corresponds
3     PRELIMINARY EXPERIMENTAL RESULTS                                   to the heated floor area. Lastly, the ETAH index takes into ac-
INDICE has been experimentally evaluated on a real collection            count all the thermal losses of each subsystem, including the
of building energy performance certificates. The EPCs are issued         generation, distribution, emission and control subsystems. The
in the years between 2016 and 2018 for buildings and flats lo-           PA user may be interested in discovering groups of buildings
cated in Piedmont, a major Italian region. This dataset has been         with homogeneous thermo-physical properties. To address this
collected and openly released by CSI Piemonte (the Information           task the K-means clustering algorithm can be applied.
System Consortium)2 and regulated by the Piedmont Region au-                 Before clustering, the correlation between the considered nu-
thority (Sustainable Energy Development Sector). The dataset             merical attributes is checked. In Figure 3, the correlation plot
includes approximately 25000 energy certificates, each one char-         matrix between the considered attribute pairs is reported. Dark
acterized by 132 features, including energy and thermo-physical          squares represent high linear correlation between the two vari-
attributes, divided into 89 categorical attributes and 43 quantita-      ables, while light squares represent low correlation. All the vari-
tive attributes.                                                         ables considered in the analysis are weakly correlated (i.e., there
INDICE has been developed in Python [29], including the scikit-          is no evident linear association between variable pairs). Hence,
learn library [25] (for the analytic tasks) and folium library [13]      the results obtained from the five attributes selected for the clus-
(for visualization purposes).                                            tering phase (i.e., S/V, Uo , Uw , Sr and ETAH) and the response
                                                                         variable Normalized primary heating energy consumption (EPH),
3.1     Case study                                                       allow the extraction of non-trivial knowledge from data. Figure
To evaluate the effectiveness of INDICE, we focus on a case study        4 shows the results obtained by the data analytics engine for
having as stakeholder the public administration (PA). The re-            the features described above. From the charts reported in the
sults are obtained by tailoring the analysis to the city of Turin        dashboard, the analyst can explore the frequency distribution of
and selecting the EPCs related to the housing units of type E.1.1        a specific attribute, as the response variable EPH, or its distribu-
(buildings used as permanent residence). To clean the geospatial         tion in the cluster set detected by INDICE. Moreover, interesting
coordinates, in the specific address, house number, ZIP Code, lat-       correlation rules4 can be extracted and visualized using a tabular
itude and longitude for each EPC, INDICE applies the algorithm           representation. In this way, every end user, independently of
proposed in Section 2. This algorithm compares the addresses             her expertise degree, can detect the attributes which influence
in the EPC dataset and the addresses in an open dataset3 pro-            most the energy performance of buildings and find out the ge-
vided by the municipality of Turin, containing the city roads,           ographical areas for which a certain set of rules apply. Driven
with street names, house numbers, ZIP Code and geolocation
                                                                         4 The discretization used for the dynamic dashboard is as follows. 4 classes for
(i.e., (latitude, longitude)). This database was used to verify the      the Average U-value of the windows (i.e., Low = [1.1, 2.05], medium = (2.05, 2.45],
reliability of the addresses in our dataset. In our case study, if       High = (2.45, 3.35] and Very high = (3.35, 5.5]); 3 classes for the Average U-value
                                                                         of vertical opaque envelope (i.e., Low = [0.15, 0.45], medium = (0.45, 0.65], High =
2 http://www.csipiemonte.it/web/it/                                      (0.65, 1.1]; 3 classes for the Average global efficiency for space heating (i.e., Low =
3 https://www.sciamlab.com/opendatahub/dataset/c_l219_260                [0.20, 0.60], medium = (0.60, 0.80], High = (0.80, 1.1].
              Figure 4: Interactive dashboard visualizing, at district level, the result of the data analytics engine.


by the extracted knowledge, the PA user may support and in-                        [3] Sergey Brin, Rajeev Motwani, Jeffrey D Ullman, and Shalom Tsur. 1997. Dy-
centive renovation policies targeting specific low performance                         namic itemset counting and implication rules for market basket data. Acm
                                                                                       Sigmod Record 26, 2 (1997), 255–264.
neighborhoods, or identifying groups of similar EPCs.                              [4] Alfonso Capozzoli, Daniele Grassi, Marco Savino Piscitelli, and Gianluca Serale.
                                                                                       2015. Discovering Knowledge from a Residential Building Stock through Data
                                                                                       Mining Analysis for Engineering Sustainability. Energy Procedia 83 (2015),
4   CONCLUSIONS AND FUTURE WORKS                                                       370 – 379. https://doi.org/10.1016/j.egypro.2015.12.212
                                                                                   [5] Tania Cerquitelli, Gianfranco Chicco, Evelina Di Corso, Francesco Ventura,
This paper presents INDICE, a new data visualization framework                         Giuseppe Montesano, Mirko Armiento, Alicia Mateo González, and An-
that analyzes EPC collections at different granularity levels. Af-                     drea Veiga Santiago. 2018. Clustering-Based Assessment of Residential Con-
                                                                                       sumers from Hourly-Metered Data. In 2018 International Conference on Smart
ter a preprocessing step, INDICE extracts interesting and hidden                       Energy Systems and Technologies (SEST). IEEE, 1–6.
knowledge for different end-users. Informative dynamic dash-                       [6] Tania Cerquitelli, Gianfranco Chicco, Evelina Di Corso, Francesco Ven-
boards have been presented to show useful information, at dif-                         tura, Giuseppe Montesano, Anita Del Pizzo, Alicia Mateo González, and
                                                                                       Eduardo Martin Sobrino. 2018. Discovering electricity consumption over
ferent geospatial levels and with enriched map representations                         time for residential consumers through cluster analysis. In 2018 International
(e.g., the cluster-marker map).                                                        Conference on Development and Application Systems (DAS). IEEE, 164–169.
   As future work we plan to integrate in INDICE other analytics                   [7] Tania Cerquitelli and Evelina Di Corso. 2016. Characterizing Thermal Energy
                                                                                       Consumption through Exploratory Data Mining Algorithms.. In EDBT/ICDT
techniques (both supervised and unsupervised) to provide a more                        Workshops.
flexible and enhanced analysis. Furthermore, the analysis process                  [8] Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Spiros
                                                                                       Athanasiou, and Spiros Skiadopoulos. 2018. Map-Based Visual Exploration
should be empowered by an automatic tool suggesting appropri-                          of Geolocated Time Series. In Proceedings of the Workshops of the EDBT/ICDT
ate analysis configurations for the considered datasets. To this                       2018 Joint Conference (EDBT/ICDT 2018), Vienna, Austria, March 26, 2018. 92–99.
aim, we are currently planning to release our framework INDICE                         http://ceur-ws.org/Vol-2083/paper-14.pdf
                                                                                   [9] Giuliano Dall’O, Luca Sarto, Nicola Sanna, Valeria Tonetti, and Martina Ven-
in order to have real feed-backs from end-users (e.g., citizens,                       tura. 2015. On the use of an energy certification database to create indicators
energy experts, public administration). In this way, we could im-                      for energy planning purposes: Application in northern Italy. Energy Policy 85,
prove the choices of the default configurations, but also include                      C (2015), 207–217.
                                                                                  [10] Evelina Di Corso, Tania Cerquitelli, and Daniele Apiletti. 2018. METATECH:
and integrate further representations to improve the visualization                     METeorological Data Analysis for Thermal Energy CHaracterization by Means
of the extracted knowledge.                                                            of Self-Learning Transparent Models. Energies 11, 6 (2018), 1336.
                                                                                  [11] Evelina Di Corso, Tania Cerquitelli, Marco Savino Piscitelli, and Alfonso
                                                                                       Capozzoli. 2017. Exploring Energy Certificates of Buildings through Unsu-
Acknowledgments                                                                        pervised Data Mining Techniques. In Internet of Things (iThings) and IEEE
                                                                                       Green Computing and Communications (GreenCom) and IEEE Cyber, Physical
The research leading to these results has been supported by the                        and Social Computing (CPSCom) and IEEE Smart Data (SmartData), 2017 IEEE
SmartData@PoliTO center for Big Data and Machine Learning                              International Conference on. IEEE, 991–998.
                                                                                  [12] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A
technologies.                                                                          density-based algorithm for discovering clusters in large spatial databases
   The authors express their gratitude to Giovanni Nuvoli (Set-                        with noise.. In Kdd. 226–231.
                                                                                  [13] Filipe and Martin Journois at all. 2018. python-visualization/folium: v0.6.0.
tore Sviluppo Energetico Sostenibile - Regione Piemonte) and to                        (Aug. 2018). https://doi.org/10.5281/zenodo.1344457
CSI Piemonte.                                                                     [14] Xiaohong Guan, Zhanbo Xu, and Qing-Shan Jia. 2010. Energy-efficient build-
                                                                                       ings facilitated by microgrid. IEEE Transactions on smart grid 1, 3 (2010),
                                                                                       243–252.
REFERENCES                                                                        [15] Frank R Hampel. 1974. The influence curve and its role in robust estimation.
                                                                                       Journal of the american statistical association 69, 346 (1974), 383–393.
[1] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining associ-
                                                                                  [16] Boris Iglewicz and David Caster Hoaglin. 1993. How to detect and handle
    ation rules between sets of items in large databases. In Acm sigmod record.
                                                                                       outliers. Vol. 16. Asq Press.
    ACM, 207–216.
[2] Leo Breiman. 2017. Classification and regression trees. Routledge.
[17] Tim Johansson, Mattias Vesterlund, Thomas Olofsson, and Jan Dahl. 2016.
     Energy performance certificates and 3-dimensional city models as a means to
     reach national targets–A case study of the city of Kiruna. Energy Conversion
     and Management 116 (2016), 42–57.
[18] B.-H. Juang and L.R. Rabiner. 1990. The segmental K-means algorithm for esti-
     mating parameters of hidden Markov models. IEEE Transactions on Acoustics,
     Speech and Signal Processing 38, 9 (Sep 1990), 1639–1641.
[19] Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions,
     insertions, and reversals. In Soviet physics doklady. 707–710.
[20] Xue Li, Lanshun Nie, and Shuo Chen. 2014. Approximate Dynamic Program-
     ming Based Data Center Resource Dynamic Scheduling for Energy Optimiza-
     tion. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September 1-3,
     2014. 494–501.
[21] Feng-Yi Lin, Tzu-Ping Lin, and Ruey-Lung Hwang. 2017. Using geospatial
     information and building energy simulation to construct urban residential
     energy use map with high resolution for Taiwan cities. Energy and Buildings
     157 (2017), 166–175.
[22] Sara Torabi Moghadam, Patrizia Lombardi, and Guglielmina Mutani. 2017. A
     mixed methodology for defining a new spatial decision analysis towards low
     carbon cities. Procedia Engineering 198 (2017), 375–385.
[23] Y Olivo, A Hamidi, and P Ramamurthy. 2017. Spatiotemporal variability in
     building energy use in New York City. Energy 141 (2017), 1393–1401.
[24] Maria-Evangelia Papadaki, Panagiotis Papadakos, Michalis Mountantonakis,
     and Yannis Tzitzikas. 2018. An Interactive 3D Visualization for the LOD
     Cloud. In Proceedings of the Workshops of the EDBT/ICDT 2018 Joint Conference
     (EDBT/ICDT 2018), Vienna, Austria, March 26, 2018. 100–103. http://ceur-ws.
     org/Vol-2083/paper-15.pdf
[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
     Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
     Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:
     Machine Learning in Python. Journal of Machine Learning Research 12 (2011),
     2825–2830.
[26] Iraci Miranda Pereira and Eleonora Sad de Assis. 2013. Urban energy con-
     sumption mapping for energy management. Energy Policy 59 (2013), 257–269.
[27] Bernard Rosner. 1983. Percentage points for a generalized ESD many-outlier
     procedure. Technometrics 25, 2 (1983), 165–172.
[28] Sheldon M Ross. 2014. Introduction to probability models. Academic press.
[29] Guido Rossum. 1995. Python Reference Manual. Technical Report. Amsterdam,
     The Netherlands, The Netherlands.
[30] Pang-Ning Tan et al. 2007. Introduction to data mining. Pearson Education
     India.
[31] John W Tukey. 1977. Box-and-whisker plots. Exploratory data analysis (1977),
     39–43.
[32] Chao-Lin Wu, Wei-Chen Chen, Yi-Show Tseng, Li-Chen Fu, and Ching-Hu Lu.
     2014. Anticipatory Reasoning for a Proactive Context-Aware Energy Saving
     System. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September
     1-3, 2014. 228–234.