Exploring energy performance certificates through visualization Tania Cerquitelli∗ , Evelina Di Corso∗ , Stefano Proto∗ , Alfonso Capozzoli† , Fabio Bellotti∗ , Maria G. Cassese∗ , Elena Baralis∗ , Marco Mellia‡ , Silvia Casagrande§ , Martina Tamburini§ ∗ Department of Control and Computer engineering, Politecnico di Torino, Torino, Italy † Department of Energy, Politecnico di Torino, Torino, Italy ‡ Department of Electronics and Telecommunications, Politecnico di Torino, Torino, Italy § Edison Spa, Torino, Italy ∗ † ‡ name.surname@polito.it § name.surname@edison.it ABSTRACT To enhance the effectiveness of data and knowledge explo- Energy Performance Certificates (EPCs) provide interesting infor- ration, a variety of data visualization techniques have been pro- mation on the standard-based calculation of energy performance, posed. In [22, 23, 26] the authors exploited choropleth maps to thermo-physical and geometrical related properties of a build- analyze the energy consumption and the electricity consump- ing. Because of the volume of available data (issued as open tion per unit area, respectively. Instead, in [21], the authors used data) and the heterogeneity of the attributes, the exploration of dynamic simulations of building energy consumption and build- these energy-related data collection is challenging. This paper ing information to develop urban energy maps with high spatial presents INDICE (INformative DynamiC dashboard Engine), a resolutions. However, all the above works proposed static maps new data visualization framework able to automatically explore to analyze the average values of some features of interest. The large collections of EPCs. INDICE explores EPCs through both exploitation of dynamic and navigable maps tailored to the anal- querying and analytics tasks, and intuitively presents the output ysis of energy-related data has not been proposed so far. The through informative dashboards. The latter include dynamic and authors in [24] propose an interactive 3D visualization to analyze interactive maps along with different informative charts allow- the Linking Open Data (LOD) cloud adopting the metaphor of ing different stakeholders (e.g., domain and non-domain expert urban area. The visualization is interactive, meaning that the users) to explore and interpret the extracted knowledge at dif- user can enlarge any part of the model, modify the perspective, ferent spatial granularity levels. The objective of INDICE is to change the shape of the buildings and their positioning, view all create energy maps useful for the characterization of the energy the connections or only those belonging to a specific data set. A performance of buildings located in different areas. The exper- parallel research effort has been devoted to explore and summa- imental evaluation, performed on a real set of EPCs related to rize geolocated time series data through maps [8]. Moreover, a a major Italian region in the North West of Italy, demonstrates great research effort has been done in [17], in which the authors the effectiveness of INDICE in exploring an EPC dataset through propose a city energy model based on the requests and need for different data and knowledge visualization techniques. visualization from a group of energy consultants. Their proposed model offers stakeholders a powerful tool for evaluating both the current state and future scenarios. 1 INTRODUCTION This paper presents INDICE (INformative DynamiC dashboard Nowadays large volumes of energy-related data are continuously Engine), a data visualization framework generating interactive collected in different domains. To reduce wasteful energy con- and navigable dashboards through the analysis of a set of Energy sumption, several orthogonal applications (e.g., buildings, IoT- Performance Certificates (EPCs). An EPC is a legal requirement based devices, wireless networks) increased their policy priority when constructing, selling or renting a building, and it provides on energy efficiency. According to the U.S. Department of En- interesting information on the calculated standard energy perfor- ergy, in industrialized countries more than 40% of total energy is mance, thermo-physical and geometrical properties of existing consumed in buildings [14]. In the last few years many efforts buildings. The multi-tiered framework INDICE has been pro- have been devoted to improve building energy efficiency with dif- posed to effectively deal with large collection of EPCs. With ferent final goals: (i) facilitating proactive energy-saving services respect to the other works, our framework brings together many [32], (ii) characterizing data streams of energy consumption of different analysis techniques to help non-expert users make sense individual residential consumers in buildings [5–7], (iii) charac- of Energy Performance Certificates. Indeed, after a pre-processing terizing heating energy demand through the analysis of energy step, cluster analysis allows discovering groups of EPCs with sim- performance certificates of buildings [4, 9, 11], and (iv) reducing ilar features. To summarize the energy performance of buildings emissions and energy consumption for buildings [20]. at different granularities, INDICE generates informative dash- boards tailored to different energy stakeholders, combining both © 2019 Copyright held by the author(s). Published in the Workshop Proceedings a rich set of interesting knowledge and ease of use. of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-WS.org. The proposed informative dashboards exploit different kinds of energy maps to show data and knowledge at different spatial granularity levels. The proposed visualization techniques allow attributes, INDICE includes a multi-step algorithm to correctly different energy stakeholders to easily capture the high-level reconstruct and correct the wrong information. Specifically, it overview of heating energy demand at a city level, and drill- compares the available addresses with a referenced street map down the knowledge to the single apartment. Moreover, in order that is usually available for each city. The referenced street map to analyze the energy efficiency of different buildings through should contain all the detailed information on streets, including the most interesting attributes under analysis, INDICE includes street names, house numbers, ZIP Code and geolocation (i.e., cluster-markers which dealing with the problem of representing latitude and longitude). Given a city under analysis, INDICE au- multiple variables at the same time. tomatically downloads the referenced street map if it is available As a case study, a real collection of EPCs related to a major online. Italian region, in North West Italy, was analyzed. Preliminary The referenced street map is exploited by INDICE to verify the experimental results show that the proposed approach is effective reliability of the addresses in the dataset under analysis to correct in visualizing a manageable set of human-readable knowledge errors in the address field and at the same time reconstruct miss- for each end-user thought dynamic and interactive maps. ing or incorrect information in the attributes ZIP Code, house The next sections of the paper are organized as follows. Sec- address, latitude and longitude. Specifically, the developed algo- tion 2 introduces an overview of the INDICE system with a thor- rithm compares each string in the dataset with the ones in the ough description of its main building blocks. Section 3 discusses referenced street map. For each couple of addresses Levenshtein the preliminary experimental results obtained on a real data col- distance [19] is computed to evaluate the similarity between two lection and Section 4 draws conclusions and presents the future character strings, in terms of the minimum number of modi- development of this work. fications (insertions, deletions and substitutions) necessary to transform the first string into the second one. The similarity 2 THE INDICE ANALYTICS SYSTEM computed from Levenshtein distance takes values in the range [0-1], where 0 indicates total dissimilarity and 1 equality of the compared strings. Given a user-defined threshold ϕ, the refer- enced address (the most similar to the address under analysis) replaces the original one if Levenshtein similarity between the two addresses is greater than or equal to ϕ. When the association to a referenced address is not possible, i.e., Levenshtein simi- larities are below ϕ, a geocoding request is sent via the Google Geocoding APIs1 . The latter is a reliable service providing a tex- tual address to reconstruct the whole address in a consistent way. However, INDICE exploits the Google Geocoding service only when the association cannot be resolved through the referenced Figure 1: The INDICE framework street map due to a limit on the number of free requests. 2.1.2 Outlier detection and removal. INDICE (INformative DynamiC dashboard Engine) has been An outlier is an extreme value that deviates from other obser- tailored to analyze any collection of EPCs. The analysis of this vations on data. It may occur either when the collected value kind of data is challenging, due to the large number of attributes does not fit the model under study or when some error happens characterizing each energy performance certificate. The exploita- during the data collection phase. To address this issue, INDICE ex- tion of this high dimensional data is burdensome due to the high ploits three approaches: (i) univariate outlier detection, (ii) mixed variability and dimensionality of data. INDICE combines different univariate analysis, and (iii) multivariate outlier detection. In- techniques to effectively visualize a rich set of knowledge items dependently of the above adopted strategies, values labelled as for a variety of energy stakeholders. The overall architecture is outliers are not considered in the subsequent steps of analysis. shown in Figure 1. INDICE includes three main building blocks, Univariate outlier. INDICE integrates three methodologies each one addressing one of the main steps of the knowledge- to automatically detect outliers and remove them for the subse- extraction process: (i) Data pre-processing, (ii) Data selection and quent analytics steps: (i) the graphic boxplot method, (ii) the para- analytics, and (iii) Data and Knowledge visualization. In the fol- metric generalized Extreme Studentized Deviate (gESD) method lowing, a detailed description of each building block is given. [27] and (iii) the non-parametric Median Absolute Deviation (MAD) [15]. The boxplot [31] (aka whiskers plot) is a conve- 2.1 Data Pre-processing nient way of visually displaying a data distribution through its The INDICE pre-processing phase aims at smoothing the effect of quartiles. The frequency distribution of each variable is summed possibly unreliable data. It performs two tasks which have been up through a few numbers (i.e. median, quartiles, min and max proved to be crucial in real-world geospatial data: (i) geospatial values). The median summarizes the central tendency of the dis- coordinates cleaning and (ii) outlier detection and removal. tribution, while the quartiles give an indication of the variability through the interquartile difference. The minimum and maxi- 2.1.1 Geospatial data cleaning. mum values provide not only information about extremes but This pre-processing step is crucial when the final aim is to dis- also on the possible presence of data with abnormal characteris- play data and knowledge through maps. INDICE includes an tics w.r.t. the other points, plotting them individually. For each ad-hoc strategy to clean geospatial attributes, including address, variable, the analyst can manually remove the outliers (i.e., the house number, ZIP Code, latitude and longitude. Since the ad- values smaller and greater than the minimum and the maximum) dress attribute is usually collected as a free text field, it often through value filters. contains numerous typos and input errors, which require care- ful analysis to be correctly fixed. To clean the above-mentioned 1 https://developers.google.com/maps/documentation/geocoding/intro The gESD method [27] is used to detect one or more outliers data mining algorithms (e.g., cluster analysis, association rules) in a univariate data set. This test needs a parameter which is data have to be properly transformed and peculiar attributes the upper bound on the number of potential outliers. INDICE have to be selected. Several techniques have been used to reduce tests the null hypothesis that the data has no outliers versus the the complexity of the datasets under analysis and discover effec- alternative hypothesis that there are at most k outliers (for some tive and hidden knowledge, interesting and readable by all the user specified value of k). Given the upper bound, k, the gESD different stakeholders involved in the analysis. This component test essentially performs k separate tests: a test for one outlier, a includes two innovative engines: (i) the query engine and (ii) the test for two outliers, and so on up to k outliers. In INDICE the data analytics engine. number of outliers is determined by finding the largest value 2.2.1 Querying engine. r (with r ≤ k), such that the corresponding test gives a value To select and explore the dataset under analysis, INDICE im- higher than the critical one. plements a query engine that lets the user focus on the single Lastly, in statistics the MAD method[15] is a robust measure of attributes of the energy performance certificates. Possible stake- the variability of a univariate sample of quantitative data. Calcu- holders may be citizens, public administration and energy scien- lating the MAD is straightforward, as it only involves finding the tists. Each of them could be interested in different characteris- median of absolute deviations from the median. It is calculated tics of the dataset under analysis. For each stakeholder, INDICE by taking the absolute difference between each point and the produces the best possible representation to highlight the main corresponding median, and then calculating the median of those interesting facets of the results. Citizens could be interested in the differences. As proposed in [16], INDICE uses the score of 3.5 as energy analysis of the buildings related to a specific area of the cut-off value. This means that every point with a score above 3.5 city, or in the geometric features that characterize the buildings is considered an outlier. belonging to the same intended use. The citizens may want to The users can exploit all the different univariate methodologies discover areas of the city with more performing buildings, to buy and/or choose the most suitable one. If a non-expert user does a flat that performs well in terms of energy efficiency. The public not know how to deal with these outlier detection techniques, administration may be instead being interested in identifying ar- she can use default configurations, as described below. eas where to promote and invest for energy renovations. Energy Expert-driven univariate analysis. Because some non- scientists could use INDICE to explore and characterize through expert users may be interested in analysing EPC collections, supervised and unsupervised techniques groups of building with INDICE suggests the univariate outlier detection method mostly similar properties to perform benchmarking analysis. Based on used by domain experts in the past interactions with INDICE. the target of each stakeholder, the system is able to automatically Specifically, by collecting and storing expert user (e.g., energy propose to the specific end-user an optimal set of interesting scientists) INDICE configurations, the non-expert users can re- reports and graphical representations, with the possibility to set ceive interesting and effective suggestions to properly deal with manually the subset of features and parameters for the queries noisy data. In the current version of INDICE, only relevant at- to which she is interested in. tributes describing the building thermo-physical characteristics (e.g., Aspect Ratio, Average U-value of the vertical opaque enve- 2.2.2 Data analytics engine. lope and Average U-value of the windows) and the efficiency of To extract meaningful and interesting knowledge items from data, the heating subsystems (e.g., Distribution Subsystem Efficiency INDICE includes different supervised and exploratory algorithms and Generation Subsystem Efficiency) have been considered. In to automatically analyze feature subsets. INDICE integrates the this way, if a non-expert user does not know which univariate K-means clustering algorithm [18] to create groups of buildings analysis technique should be used, she can use a configuration with similar thermo-physical and energy properties, and associ- adopted by previous INDICE expert users, since their choices ation rule mining [1] to extract interesting correlations among are automatically stored as default configurations for non-expert features. users. K-means algorithm. The partitional K-means cluster algo- Multivariate outlier detection. For the multivariate outlier rithm [18] is exploited by INDICE to identify groups of EPCs detection, INDICE integrates the DBSCAN algorithm (Density- characterized by similar properties. To measure the similarity be- Based Spatial Clustering of Application with Noise) [12] to auto- tween EPCs, the Euclidean distance is computed. The K-means al- matically identify outliers. Specifically, DBSCAN detects clusters gorithm, which is the most popular clustering algorithm, divides based on a density reachability concept, where clusters with the input dataset into K groups, where K is defined a-priori. The higher-density regions are separated by lower-density regions. average of all the energy certificates in each cluster represents DBSCAN requires two user-defined parameters (i.e., minPoints the centroid (representative point) of each group of buildings. and Epsilon). To properly specify these input parameters INDICE First, the algorithm chooses randomly K initial centroids. Then, plots the k-distance graph and automatically estimates a good each point is assigned to the closest centroid and the centroids are value for each parameter. As proposed in [10], INDICE runs sev- recalculated. The previous steps are repeated until the centroids eral times the k-distance plot for different values of minPoints, no longer change. K-means is able to identify a good cluster set and selects minPoints when the curve stabilises, and Espilon as in a limited computational time. INDICE analyses the trend of the elbow point of the stable curve. the SSE (saReadum of squared error) quality index to evaluate the cluster cohesion [30] and automatically identify possible good K values. The SSE is computed as the total sum of squared errors 2.2 Data selection and analytics for all objects in the collection, where for each object the error is The knowledge visualization step is preceded by a data selection computed as the squared distance from the closest centroid. As and analytics phase. Since each energy performance certificate done in [30], in INDICE the K value is chosen as the point where includes a large number of features characterized by a great vari- the marginal decrease in the SSE curve is maximized (aka elbow ability, in order to extract accessible knowledge and implement approach). Association rules. One of the most powerful exploratory of statistics and generally difficult to interpret, since the geo- techniques in data mining aiming at finding interesting correla- localized EPC data lend themselves very well to be visualized tions among data is represented by association rule discovery [1]. on maps, INDICE proposes several techniques to explore and An association rule is expressed in the form A → B, where A and visualize the knowledge extracted from EPCs. B are disjoint and non-empty itemsets, (i.e., A ∩ B = ∅). A is also The dashboards include (i) geospatial maps, including traditional called rule antecedent and B rule consequent. Since association maps as choropleth and scatter maps and a new type of map rules extraction operates on a transactional dataset of categorical named cluster-marker map, (ii) frequency distribution plots, (iii) as- attributes, a discretization step is needed to convert the original sociation rules, and (iv) correlation matrices. These visualization continuously-valued measurements into categorical bins. The techniques are jointly exploited by INDICE to graphically show discretization adopted in INDICE are described in [11]. The used the extracted knowledge at different spatial granularity levels technique involves creating a decision CART (Classification And such as city, district, neighbourhood, or housing unit (e.g., cer- Regression Tree) [2] for each variable, using as response variable tificates belonging to the same building). the annual primary energy demand normalized on the floor area. Geospatial maps. In INDICE, three geospatial maps have The tree splits are used as bins in the discretization process. To been integrated: (i) choropleth maps, (ii) scatter maps, and (iii) select only a subset of interesting rules, constraints on various cluster-marker maps. These energy maps are related to each other, goodness measures are used. INDICE includes four well-known as each user can switch from one view to another, simply by quality indices: i) support, ii) confidence, iii) lift, and iv) conviction. changing the analysis zoom (i.e., drill down in the energy map) The rule support is the percentage of transactions that contain or introducing the knowledge of the cluster-markers. In choro- both antecedent and consequent; confidence is the conditional pleth maps each area (at different zoom levels) is colored ac- probability that the consequent is true under the condition of the cording to the average value of the considered variable for the antecedent; lift [30] measures the correlation between the an- area under analysis. The scatter maps report a point and its tecedent and the consequent; conviction [3] measures the degree corresponding value for each EPC (and so residential unit) con- of implication of a rule. Default thresholds are set by INDICE tained in the selected area. Cluster-marker maps, similarly to however the end-user could change the default values to analyze the choropleth maps, aggregate multiple certificates coloring the at different granularity level the extracted rules. dynamic markers according to the average of the values of the aggregated points. While the first two geospatial maps (i.e., choro- 2.3 Informative dashboard pleth and scatter maps) are useful for analyzing single variables, the cluster-marker visualization faces the problem of represent- ing multiple variables at the same time. Specifically, exploring a single variable at coarse granularity levels could lead to flat and poor representative maps. To this extent, INDICE includes cluster-markers to introduce a new feature to the maps, in order to analyze the energy efficiency of several buildings through various attributes. The cardinality of the corresponding cluster affects the size of the marker and is reported inside the marker. These maps have been used together, ensuring in a single solution different levels of detail depending on the zoom degree selected by the user. Figure 2 shows examples of analysis results at differ- ent granularity levels, visualizing various information features on the maps. In the upper part of Figure 2, a set of attributes (i.e., the Average U-value of the vertical opaque envelope and the Average U-value of the windows, see Section 3 for further attribute details) extracted from the EPCs by means of the querying engine has been displayed. The choropleth map shows the average value of the attributes for the selected area together with the scatter marker of each single point, visualized at neighbourhood and housing unit zoom levels, respectively. The users can navigate Figure 2: Example of choropleth and scatter map at single the map and check the attribute values for each certificate by certificate (Upper Left) and neighbourhood level (Upper clicking on the markers. In the bottom part of Figure 2, the in- Right), and Cluster-marker maps at district (Bottom Left) formation obtained through the data analytics engine (e.g., the and city levels (Bottom Right). identification of the areas characterized by lower and medium energy performances) has been visualized at district (Left) and city (Right) levels. The cluster-markers show the cardinality of The aim of this component is to visualize and make the infor- each cluster, together with the average value of an independent mation and the extracted knowledge easy to be interpreted at response variable chosen in the analytic process. different levels of detail. To this extent, INDICE includes interac- Frequency distribution plots. For a given area, the frequency tive and navigable dashboards tailored to different use cases, pro- distributions (e.g., quartiles or deciles) of the features selected viding both domain specific information and high-level energy for the visualization task are reported. A frequency distribution demand overviews. Indeed, the dashboards can be customized for of data can be shown in a table or graph/diagrams. Some com- each end-user, providing deep targeted knowledge for domain mon methods include frequency tables, histograms or bar charts. experts and human-readable informative contents for non-expert These distributions can refer to single attributes or to aggregate users. Besides displaying charts and diagrams, which are typical information extracted from the analytic task, hence to groups of similar certificates according to the subsets of attributes selected for the analysis. INDICE provides a setting panel to select one or more distribution visualizations, including the description of the main statistical indices. For numeric data, INDICE includes count, mean, standard deviation and the three quartiles (i.e., median, first and third quartiles), while for categorical attributes, the count, the most common value’s frequency (i.e., mode) and the top-k frequent values are reported. The end-user can select a response variable against which to color the attribute distributions. Association rules. INDICE discovers correlations in terms of association rules. However, to ease the manual inspection of the most interesting correlations, INDICE defines templates to characterize the attributes and represent the association rules using a tabular visualization. By sorting on quality indices, only Figure 3: Correlation matrix between pairs of numerical the top-k rules that satisfy all constraints may be displayed. Rules attributes can be extracted at different granularity levels, e.g., for each city, neighbourhood or downstream of the clustering algorithm. Correlation matrices. To reduce the complexity of the anal- the PA user is interested in discovering which areas of a city are ysis and remove correlated attributes from the analytic process, more energy consuming and which are more efficient, she could INDICE proposes correlation matrices to analyze the dependence select the following subset of attributes, which characterize the between variables. For each pair of numerical attributes X and Y, thermo-physical properties of each building: Aspect Ratio (S/V), the framework computes the Pearson correlation coefficient [28], cov(X,Y ) Average U-value of the vertical opaque envelope (Uo ), Average U- defined as ρ X,Y = σX σY where cov(X , Y ) is the covariance value of the windows (Uw ), Heat surface (Sr ) and Average global between X and Y , σX is the standard deviation of X and analo- efficiency for space heating (ETAH). The Aspect Ratio represents gously σY for Y . Each coefficient value is translated into a gray the geometric shape of a building. Uo and Uw measure the heat level in the black-and-white scale to represent the correlation loss through the opaque and the transparent elements of the intensity in a plot matrix. When the selected set of attributes has building, respectively. The lower the thermal transmittance of no evident linear correlation, it is eligible for the analytic task. the building envelope, the lower the heat flow that is transmitted through the elements themselves. The Heat surface corresponds 3 PRELIMINARY EXPERIMENTAL RESULTS to the heated floor area. Lastly, the ETAH index takes into ac- INDICE has been experimentally evaluated on a real collection count all the thermal losses of each subsystem, including the of building energy performance certificates. The EPCs are issued generation, distribution, emission and control subsystems. The in the years between 2016 and 2018 for buildings and flats lo- PA user may be interested in discovering groups of buildings cated in Piedmont, a major Italian region. This dataset has been with homogeneous thermo-physical properties. To address this collected and openly released by CSI Piemonte (the Information task the K-means clustering algorithm can be applied. System Consortium)2 and regulated by the Piedmont Region au- Before clustering, the correlation between the considered nu- thority (Sustainable Energy Development Sector). The dataset merical attributes is checked. In Figure 3, the correlation plot includes approximately 25000 energy certificates, each one char- matrix between the considered attribute pairs is reported. Dark acterized by 132 features, including energy and thermo-physical squares represent high linear correlation between the two vari- attributes, divided into 89 categorical attributes and 43 quantita- ables, while light squares represent low correlation. All the vari- tive attributes. ables considered in the analysis are weakly correlated (i.e., there INDICE has been developed in Python [29], including the scikit- is no evident linear association between variable pairs). Hence, learn library [25] (for the analytic tasks) and folium library [13] the results obtained from the five attributes selected for the clus- (for visualization purposes). tering phase (i.e., S/V, Uo , Uw , Sr and ETAH) and the response variable Normalized primary heating energy consumption (EPH), 3.1 Case study allow the extraction of non-trivial knowledge from data. Figure To evaluate the effectiveness of INDICE, we focus on a case study 4 shows the results obtained by the data analytics engine for having as stakeholder the public administration (PA). The re- the features described above. From the charts reported in the sults are obtained by tailoring the analysis to the city of Turin dashboard, the analyst can explore the frequency distribution of and selecting the EPCs related to the housing units of type E.1.1 a specific attribute, as the response variable EPH, or its distribu- (buildings used as permanent residence). To clean the geospatial tion in the cluster set detected by INDICE. Moreover, interesting coordinates, in the specific address, house number, ZIP Code, lat- correlation rules4 can be extracted and visualized using a tabular itude and longitude for each EPC, INDICE applies the algorithm representation. In this way, every end user, independently of proposed in Section 2. This algorithm compares the addresses her expertise degree, can detect the attributes which influence in the EPC dataset and the addresses in an open dataset3 pro- most the energy performance of buildings and find out the ge- vided by the municipality of Turin, containing the city roads, ographical areas for which a certain set of rules apply. Driven with street names, house numbers, ZIP Code and geolocation 4 The discretization used for the dynamic dashboard is as follows. 4 classes for (i.e., (latitude, longitude)). This database was used to verify the the Average U-value of the windows (i.e., Low = [1.1, 2.05], medium = (2.05, 2.45], reliability of the addresses in our dataset. In our case study, if High = (2.45, 3.35] and Very high = (3.35, 5.5]); 3 classes for the Average U-value of vertical opaque envelope (i.e., Low = [0.15, 0.45], medium = (0.45, 0.65], High = 2 http://www.csipiemonte.it/web/it/ (0.65, 1.1]; 3 classes for the Average global efficiency for space heating (i.e., Low = 3 https://www.sciamlab.com/opendatahub/dataset/c_l219_260 [0.20, 0.60], medium = (0.60, 0.80], High = (0.80, 1.1]. Figure 4: Interactive dashboard visualizing, at district level, the result of the data analytics engine. by the extracted knowledge, the PA user may support and in- [3] Sergey Brin, Rajeev Motwani, Jeffrey D Ullman, and Shalom Tsur. 1997. Dy- centive renovation policies targeting specific low performance namic itemset counting and implication rules for market basket data. Acm Sigmod Record 26, 2 (1997), 255–264. neighborhoods, or identifying groups of similar EPCs. [4] Alfonso Capozzoli, Daniele Grassi, Marco Savino Piscitelli, and Gianluca Serale. 2015. Discovering Knowledge from a Residential Building Stock through Data Mining Analysis for Engineering Sustainability. Energy Procedia 83 (2015), 4 CONCLUSIONS AND FUTURE WORKS 370 – 379. https://doi.org/10.1016/j.egypro.2015.12.212 [5] Tania Cerquitelli, Gianfranco Chicco, Evelina Di Corso, Francesco Ventura, This paper presents INDICE, a new data visualization framework Giuseppe Montesano, Mirko Armiento, Alicia Mateo González, and An- that analyzes EPC collections at different granularity levels. Af- drea Veiga Santiago. 2018. Clustering-Based Assessment of Residential Con- sumers from Hourly-Metered Data. In 2018 International Conference on Smart ter a preprocessing step, INDICE extracts interesting and hidden Energy Systems and Technologies (SEST). IEEE, 1–6. knowledge for different end-users. Informative dynamic dash- [6] Tania Cerquitelli, Gianfranco Chicco, Evelina Di Corso, Francesco Ven- boards have been presented to show useful information, at dif- tura, Giuseppe Montesano, Anita Del Pizzo, Alicia Mateo González, and Eduardo Martin Sobrino. 2018. Discovering electricity consumption over ferent geospatial levels and with enriched map representations time for residential consumers through cluster analysis. In 2018 International (e.g., the cluster-marker map). Conference on Development and Application Systems (DAS). IEEE, 164–169. As future work we plan to integrate in INDICE other analytics [7] Tania Cerquitelli and Evelina Di Corso. 2016. Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms.. In EDBT/ICDT techniques (both supervised and unsupervised) to provide a more Workshops. flexible and enhanced analysis. Furthermore, the analysis process [8] Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Spiros Athanasiou, and Spiros Skiadopoulos. 2018. Map-Based Visual Exploration should be empowered by an automatic tool suggesting appropri- of Geolocated Time Series. In Proceedings of the Workshops of the EDBT/ICDT ate analysis configurations for the considered datasets. To this 2018 Joint Conference (EDBT/ICDT 2018), Vienna, Austria, March 26, 2018. 92–99. aim, we are currently planning to release our framework INDICE http://ceur-ws.org/Vol-2083/paper-14.pdf [9] Giuliano Dall’O, Luca Sarto, Nicola Sanna, Valeria Tonetti, and Martina Ven- in order to have real feed-backs from end-users (e.g., citizens, tura. 2015. On the use of an energy certification database to create indicators energy experts, public administration). In this way, we could im- for energy planning purposes: Application in northern Italy. Energy Policy 85, prove the choices of the default configurations, but also include C (2015), 207–217. [10] Evelina Di Corso, Tania Cerquitelli, and Daniele Apiletti. 2018. METATECH: and integrate further representations to improve the visualization METeorological Data Analysis for Thermal Energy CHaracterization by Means of the extracted knowledge. of Self-Learning Transparent Models. Energies 11, 6 (2018), 1336. [11] Evelina Di Corso, Tania Cerquitelli, Marco Savino Piscitelli, and Alfonso Capozzoli. 2017. Exploring Energy Certificates of Buildings through Unsu- Acknowledgments pervised Data Mining Techniques. In Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical The research leading to these results has been supported by the and Social Computing (CPSCom) and IEEE Smart Data (SmartData), 2017 IEEE SmartData@PoliTO center for Big Data and Machine Learning International Conference on. IEEE, 991–998. [12] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A technologies. density-based algorithm for discovering clusters in large spatial databases The authors express their gratitude to Giovanni Nuvoli (Set- with noise.. In Kdd. 226–231. [13] Filipe and Martin Journois at all. 2018. python-visualization/folium: v0.6.0. tore Sviluppo Energetico Sostenibile - Regione Piemonte) and to (Aug. 2018). https://doi.org/10.5281/zenodo.1344457 CSI Piemonte. [14] Xiaohong Guan, Zhanbo Xu, and Qing-Shan Jia. 2010. Energy-efficient build- ings facilitated by microgrid. IEEE Transactions on smart grid 1, 3 (2010), 243–252. REFERENCES [15] Frank R Hampel. 1974. The influence curve and its role in robust estimation. Journal of the american statistical association 69, 346 (1974), 383–393. [1] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining associ- [16] Boris Iglewicz and David Caster Hoaglin. 1993. How to detect and handle ation rules between sets of items in large databases. In Acm sigmod record. outliers. Vol. 16. Asq Press. ACM, 207–216. [2] Leo Breiman. 2017. Classification and regression trees. Routledge. [17] Tim Johansson, Mattias Vesterlund, Thomas Olofsson, and Jan Dahl. 2016. Energy performance certificates and 3-dimensional city models as a means to reach national targets–A case study of the city of Kiruna. Energy Conversion and Management 116 (2016), 42–57. [18] B.-H. Juang and L.R. Rabiner. 1990. The segmental K-means algorithm for esti- mating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing 38, 9 (Sep 1990), 1639–1641. [19] Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady. 707–710. [20] Xue Li, Lanshun Nie, and Shuo Chen. 2014. Approximate Dynamic Program- ming Based Data Center Resource Dynamic Scheduling for Energy Optimiza- tion. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September 1-3, 2014. 494–501. [21] Feng-Yi Lin, Tzu-Ping Lin, and Ruey-Lung Hwang. 2017. Using geospatial information and building energy simulation to construct urban residential energy use map with high resolution for Taiwan cities. Energy and Buildings 157 (2017), 166–175. [22] Sara Torabi Moghadam, Patrizia Lombardi, and Guglielmina Mutani. 2017. A mixed methodology for defining a new spatial decision analysis towards low carbon cities. Procedia Engineering 198 (2017), 375–385. [23] Y Olivo, A Hamidi, and P Ramamurthy. 2017. Spatiotemporal variability in building energy use in New York City. Energy 141 (2017), 1393–1401. [24] Maria-Evangelia Papadaki, Panagiotis Papadakos, Michalis Mountantonakis, and Yannis Tzitzikas. 2018. An Interactive 3D Visualization for the LOD Cloud. In Proceedings of the Workshops of the EDBT/ICDT 2018 Joint Conference (EDBT/ICDT 2018), Vienna, Austria, March 26, 2018. 100–103. http://ceur-ws. org/Vol-2083/paper-15.pdf [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [26] Iraci Miranda Pereira and Eleonora Sad de Assis. 2013. Urban energy con- sumption mapping for energy management. Energy Policy 59 (2013), 257–269. [27] Bernard Rosner. 1983. Percentage points for a generalized ESD many-outlier procedure. Technometrics 25, 2 (1983), 165–172. [28] Sheldon M Ross. 2014. Introduction to probability models. Academic press. [29] Guido Rossum. 1995. Python Reference Manual. Technical Report. Amsterdam, The Netherlands, The Netherlands. [30] Pang-Ning Tan et al. 2007. Introduction to data mining. Pearson Education India. [31] John W Tukey. 1977. Box-and-whisker plots. Exploratory data analysis (1977), 39–43. [32] Chao-Lin Wu, Wei-Chen Chen, Yi-Show Tseng, Li-Chen Fu, and Ching-Hu Lu. 2014. Anticipatory Reasoning for a Proactive Context-Aware Energy Saving System. In IEEE iThings/GreenCom/CPSCom 2014, Taipei, Taiwan, September 1-3, 2014. 228–234.