<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterizing thermal energy consumption through exploratory data mining algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tania Cerquitelliy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Politecnico di Torino - Torino</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy y</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>name.surname}@polito.it</string-name>
        </contrib>
      </contrib-group>
      <fpage>16</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Nowadays large volumes of energy data are continuously collected through a variety of meters from di erent smartcity environments. Such data have a great potential to inuence the overall energy balance of our communities by optimizing building energy consumption and by enhancing people's awareness of energy wasting. This paper presents FARTEC, a data mining engine based on exploratory and unsupervised data mining algorithms to characterize building energy consumption together with meteorological conditions. FARTEC exploits a joint approach coupling cluster analysis and association rules. First, a partitional clustering algorithm is applied to weather conditions to discover groups of thermal energy consumption that occurred in similar weather conditions. Each computed cluster is then locally characterized through a set of association rules to ease the manual inspection of the most interesting correlations between thermal consumption and weather conditions. FARTEC also includes a categorization of the rules into a few groups according to their meaning. Each group is determined by the data features appearing in the rule. The experimental evaluation performed on real datasets demonstrates the e ectiveness of the proposed approach in discovering interesting knowledge items to raise people's awareness of their energy consumption.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Nowadays the demand for energy in the main urban
sectors is driven by human activities and by people's
awareness of wasting energy. It is challenging to increase people's
awareness and persuade them to pursue energy-saving
behaviours but it is fundamental to have a positive impact on
the global energy balance. Many research activities have
been carried out to use database technologies and
statistical tools to store and analyze energy data to evaluate the
e ciency of buildings. Research contributions on
energyrelated data have been carried out for: (i) supporting data
visualization and warning noti cation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]; (ii) e cient
storing and retrieval operations based on NoSQL databases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ];
(iii) characterizing building consumption [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
consumption pro les among di erent users [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Data mining emerged
during the late 1980s and focused on studying algorithms to
nd implicit, previously unknown, and potentially useful
information from large volumes of data. Data mining activities
include studying correlations among data (e.g., association
rules at di erent levels of abstraction), grouping data with
similar properties (e.g., clustering), and extracting
information for prediction (e.g., classi cation, regression). The
rst two classes of algorithms are the most interesting ones
for their exploratory nature, as they do not require a-priori
knowledge (such as the target class to be predicted), thus
supporting di erent and interesting targeted analyses. The
exploitation of these approaches on energy-related data is of
paramount importance to bring interesting, actionable, and
hidden knowledge to the surface.
      </p>
      <p>
        This paper presents an exploratory data mining engine,
named FARTEC (From Association Rules To Energy
Consumption), targeted at energy-related data. FARTEC
analyzes energy data collections enriched with meteorological
data through a two-level methodology based on cluster
analysis and association rules. The clustering analysis allows the
discovery of groups of thermal energy consumption that
occurred with similar weather conditions. Each cluster is then
locally characterized by a set of interesting patterns to
summarize cluster content and to highlight correlations among
thermal energy consumption and meteorological conditions.
Speci cally, FARTEC includes the K-means algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to
cluster weather data while using the association rule miner
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to model correlations among energy data and
meteorological conditions. A categorization of rules into a few reference
classes according to their meaning has been also proposed.
As a case study, FARTEC has been validated on real
energy consumption collected in a major Italian city. These
data have been integrated with meteorological data.
Preliminary experimental results show that the proposed
approach is e ective in discovering interesting correlations to
raise people's awareness of their energy consumption.
      </p>
      <p>In this paper, Section 2 introduces an overview of the
FARTEC system, while a thorough description of its main
components is presented in Section 3. Section 4 discusses
the preliminary experimental results obtained on real data,
and Section 5 draws conclusions and presents the future
development of this work.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>THE FARTEC SYSTEM</title>
      <p>
        Figure 1 shows the overall architecture of the FARTEC
system to collect, integrate, characterize, and analyze
energyrelated data by making people aware of their energy and
thermal consumption, as well as encouraging them to
pursue energy saving strategies. FARTEC includes four main
components, named Data collection and integration, Data
preprocessing, Knowledge extraction and Knowledge
visualization. These components are brie y described below and a
more detailed description is given in Section 3. In FARTEC
the Data collection and integration component stores
measurements on energy consumption every 5 minutes and
aggregates them in hourly thermal energy consumption. These
data are enriched with spatial and temporal information at
di erent abstraction levels as well as with various hourly
meteorological conditions. The enriched dataset is stored
in a datawarehouse as proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Di erent phases of
Data preprocessing are then performed to prepare data for
the subsequent analysis. The Knowledge extraction
component discovers groups of energy consumption levels
associated with similar meteorological conditions as well as
correlations among thermal energy consumption and
meteorological conditions. Discovered correlations, in the form
of association rules [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], are categorized into a few reference
classes according to their meaning. Lastly, the Knowledge
visualization component shows user-friendly plots to
summarize building performance over time.
      </p>
    </sec>
    <sec id="sec-3">
      <title>THE FARTEC COMPONENTS</title>
      <p>
        The analysis process in FARTEC is applied on data as
modeled in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Thus, the Data collection and integration
component collects thermal energy consumption, roughly
every 5 minutes, from a large number of smart meters
deployed in a major Italian city, and aggregates them every
hour. As proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], these data are enriched with
temporal information at di erent granularity levels as well
as with various meteorological conditions available as open
data sources. Weather data include temperature, relative
humidity, precipitation, wind direction, UV index, solar
radiation and atmospheric pressure. In this paper we mainly
focus on the exploitation of exploratory and unsupervised
data mining algorithms to characterize energy consumption
at di erent coarse granularities. Di erent criteria can be
exploited to select only a portion of data (e.g., daily energy
consumption in a winter season) stored in the datawarehouse
to address a targeted analysis. The FARTEC components,
addressing the main phases of the analysis process, are
described in the next sections.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Data preprocessing</title>
      <p>Extracting actionable knowledge from data is a multi-step
process. The knowledge extraction phase is preceded by a
preprocessing phase, which aims to smooth the e ect of
possibly unreliable measurements. Preprocessing entails the
following steps: (i) outlier detection and removal, (ii) missing
value handling, and (iii) correlation analysis.</p>
      <p>Outlier detection and removal. An outlier is an
observation that lies outside the expected range of values. It
may occur either when a measurement does not t the model
under study or when an error in measurement happens (e.g.,
faulty sensors may provide unacceptable measurements for
the thermal energy consumption). To address this issue,
FARTEC exploits the boxplot (also known as whiskers plot)
to graphically show groups of numerical data through their
quartiles. The boxplot sums up data distribution through a
few numbers (i.e. median, quartiles, min and max values)
modeling the frequency distribution. The median
summarizes the central tendency of the distribution and compared
to quartiles provides information about the asymmetry of
the distribution. The quartiles give an indication of the
variability through the di erence interquantile. Extremes
not only provide information on the maximum and
minimum value but also on the possible presence of data with
abnormal characteristics, plotting them as individual points.</p>
      <p>Missing value handling is an important step that
signi cantly a ects the mining process. Since we focus on the
characterization of thermal energy consumption, we only
consider data records where the corresponding consumption
value is available. However, FARTEC exploits two
strategies to handle missing values on other considered features
(e.g., meteorological data): (i) replace them with the daily
average value or (ii) replace them with the hourly average
value computed in the last week. The choice is mainly driven
by the physical meaning of each attribute. For example, case
(i) is exploited for the precipitation and wind direction
attributes, while case (ii) is for the solar radiation and UV
index attributes.</p>
      <p>Correlation analysis. Correlated attributes have
similar impact in the analysis process. Thus, they are usually
removed to reduce the space and time complexity of data
mining algorithms. FARTEC leverages the correlation
matrix to analyze the dependence between multiple variables
at the same time. Each correlation coe cient between each
variable and the others is computed through the Pearson
correlation de ned as</p>
      <p>X;Y =
cov(X; Y )</p>
      <p>X Y
(1)
where cov(X; Y ) is the covariance between X and Y , X is
the standard deviation of X and analogously Y for Y .
Correlation coe cients are not in uenced by the measurement
unit of the attributes. The higher the coe cient values the
stronger the correlation.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Knowledge extraction</title>
      <p>To extract meaningful and interesting knowledge items
from data while maintaining the number of extracted results
within manageable limits, the analysis should be performed
on the most interesting subsets of input data and the results
manually evaluated by a domain expert. Selecting speci c
subsets from which interesting knowledge can be
independently derived is of paramount importance to bring hidden
knowledge to the surface. For this purpose, FARTEC
exploits a clustering algorithm to identify speci c data subsets
from which interesting data correlations can be discovered.
Speci cally, since energy consumption is strongly in uenced
by weather conditions, the identi cation of energy
consumption records that occurred with similar weather conditions
reduces both the complexity of the correlation analysis and
the cardinality of the extracted rules to be manually
validated. FARTEC uses a clustering algorithm to partition
data in subsets. Before the clustering phase the dataset
is normalized with the range transformation (0,1). Each
cluster is then locally characterized by a set of association
rules to model the most interesting correlations among data.
FARTEC also includes a categorization of extracted rules
in a few groups to ease manual inspection by the domain
expert.
3.2.1</p>
      <sec id="sec-5-1">
        <title>Clustering</title>
        <p>
          Clustering algorithms divide data into groups/subsets
(clusters) so that objects within the same group are more similar
to each other than objects assigned to di erent groups [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
In FARTEC, groups are identi ed by analyzing records of
meteorological conditions and the distance between two
objects is computed with the Euclidean distance. The aim is to
discover records of energy consumption that occurred with
similar weather data. FARTEC integrates a partitional
algorithm, the K-means algorithm [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], to subdivide the input
dataset into K groups, where K is de ned by the user and
each object is assigned to a single cluster. Each group is
represented by its centroid computed as the average of all
the objects in the cluster. First, the algorithm sets K
initial centroids, chosen randomly. Then each point is assigned
iteratively to the closest centroid. Next, the centroids are
recalculated. The algorithm repeats the previous steps until
the centroids no longer change. K-means is probably the
most popular clustering algorithm [
          <xref ref-type="bibr" rid="ref13 ref5">5, 13</xref>
          ], although it has
a bias towards clusters with a spherical shape. However, it
identi es the cluster set in a limited computational time by
producing a quite good cluster set. K-means requires the
number of clusters to be speci ed in advance, which is one
of the biggest drawbacks. To address this issue, FARTEC
analyzes the trend of the SSE quality index and the optimal
value of K must be selected at the coordinates where the
marginal decrease in the SSE curve is maximized. The SSE
index [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] measures the cluster quality in terms of cluster
cohesion. It is is computed as the total sum of squared errors
for all objects in the collection, where for each object the
error is computed as the squared distance from the closest
centroid.
3.2.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Association rules extraction</title>
        <p>FARTEC discovers correlations from the cluster set
identi ed by the K-Means algorithm. Discovered correlations,
in terms of association rules, model interesting relationships
among the data under analysis. A transactional dataset D is
a set of transactions in which each one is a set of items (also
called itemset). An item is represented in the form attribute
= value. Since we are interested in analyzing energy-related
data, each attribute may describe energy consumption,
meteorological data (e.g., wind direction, UV index), temporal
data (e.g., daily time slot). Since the associaton rule
mining requires a transactional dataset of categorical attributes,
FARTEC applies the discretization step to convert
continuously valued measurements into categorical bins. An
association rule is expressed in the form X ! Y , where X and
Y are disjoint itemsets, i.e. X \ Y = ;. X is also called
rule antecedent and Y rule consequent. The rule quality is
measured through two basic indices, named support (s) and
conf idence (c). The rule support is the percentage of records
containing both X and Y . It represents the prior
probability of X [ Y (i.e. its observed frequency) in the dataset.
The rule con dence, instead, is the conditional probability
of nding Y given X.</p>
        <p>
          Given a set of transactions D, FARTEC nds all the
rules having support minsup and con dence minconf ,
where minsup and minconf are the corresponding support
and con dence thresholds that are user-speci ed
parameters. To rank the most interesting rules, FARTEC uses the
lift index [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which measures the (symmetric) correlation
between antecedent and consequent of the extracted rules.
When a rule has lift equal to one, the occurrence
probability of the antecedent and the consequent are independent,
so X and Y are not correlated. Lift values above 1 show
a positive correlation between itemsets X and Y, while
values below 1 indicate a negative correlation. FARTEC ranks
rules according to their lift value to focus on the subset of
most positively correlated rules.
3.3
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Association rule categorization</title>
      <p>FARTEC includes a categorization of the rules into a few
groups according to their meaning to ease the manual
inspection of the domain expert. The meaning of a rule is
determined by its template which includes the attributes
characterizing data. We de ned three basic classes of rules
that progressively provide more detailed information.
Templates are summarized in Table 1, where a basic example
rule is reported for each of them.</p>
      <p>Speci cally, the rst template models the Correlations
among cluster and weather conditions included in it, as shown
in Table 1 at row T 1. This template mainly focuses on the
weather conditions that characterize each cluster, without
considering the other aspects. We only consider 2-length
rules to extract the peculiar characteristics of the climatic
conditions of each cluster. This rule set is extracted from
the complete cluster set. At row T 2 the template models the
Correlations among weather conditions included in the
cluster. This template models the cluster content based on the
most frequent weather conditions. This kind of rule is locally
extracted from each cluster content. The third template
at row T 3 in Table 1 models the Correlations among
energy consumption level, time, and weather conditions. This
template models the correlation between weather conditions
and energy consumption level at a di erent time
granularity. This kind of rule is locally extracted from each cluster
content enriched with the energy consumption information.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTAL RESULTS</title>
      <p>
        We performed a preliminary analysis of energy
consumption on a real dataset, including energy consumption of 15
residential buildings, using the FARTEC engine. We
considered energy data related to a complete winter period from
October 15th, 2014 to April 15th, 2015. Data collected
through the smart meters are integrated with
meteorological information collected from the Weather Underground
web service[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which gathers data from Personal Weather
Stations (PWS) registered by users. These data are
analyzed for each building separately. We addressed three
issues: (i) outlier detection and correlation analysis (Section
4.1); (ii) cluster characterization in terms of data
distribution in each cluster (Section 4.2) and representative
association rules (Section 4.3); (iii) knowledge visualization
(Section 4.4); (iv) FARTEC sensitivity and robustness to
parameter setting (Section 4.5). Here we discuss a given
building which is representative of the group of buildings in
the considered dataset.
      </p>
      <p>
        Based on the experimental evaluation discussed in Section
3.2, parameter setting (K=4, minconf =1%, minsup=1%,
minlif t=1:1) has been used as reference default con
guration for FARTEC . To address the problem of centroids
initialization for the K-means algorithm we performed multiple
runs, with randomly chosen initial centroids and the
number of iterations set to 20. The open source RapidMiner
toolkit [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has been used for the correlation analysis, cluster
analysis and association rule extraction. The toolkit
MATLAB has been used to perform the analysis of data
distribution. Experiments were performed on a 2.66-GHz Intel(R)
Core(TM)2 Quad PC with 8 GBytes of main memory.
4.1
      </p>
    </sec>
    <sec id="sec-8">
      <title>Oulier detection and correlation analysis</title>
      <p>Here we discuss the preliminary results performed to
address the outlier detection and removal phase as well as
the correlation analysis step performed by FARTEC. Since
data collected from sensors are expected to be dirty,
collected measurements are analyzed one phenomenon at a time
through boxplot. Humidity measurements are discussed as
a representative example. Figure 2 shows the humidity
distribution of measurements related to a winter period
before and after outlier removal. In the left part of the gure
is shown the boxplot with the presence of outliers. The
plot highlights the presence of incompliant (with humidity
percentage values) measurements. To ease the manual
inspection of values outside the allowable range, the boxplot
shows outliers as individual points in the graph. Figure 2
(right) shows the humidity distribution in the absence of
values classi ed as outliers. The boxplot has the median value
close to 70% and 50% of data falls in the interquartile range
[55% 85%].</p>
      <p>FARTEC exploits the correlation matrix to analyse the
dependence between multiple variables at the same time.
The correlation matrix shown in Table 5 contains the
correlation coe cients between each couple of attributes
computed as discussed in Section 3.1. This matrix is
symmetric (i.e. the correlation of column i with column j is the
same as the correlation of column j with column i), and
its generic element (i; j) models the correlation between the
attribute in row i and the one in column j. Correlation
coe cients always lie in the range [ 1; 1]. A positive value
(]0; 1]) implies a positive correlation between attributes i
and j. Thus, large (small) values of attribute i tend to be
associated with large (small) values of attribute j. A
negative value ([ 1; 0]) means a negative or inverse association.
In this case large values of i tend to be associated with small
values of j and vice versa. A value near 0 indicates weakly
correlated data. Elements on the diagonal of the matrix are
always 1, since they represent the correlation of an attribute
with itself. The matrix shown in Table 5 has been computed
on data, available for a given building, of a complete winter
period. These results highlight two strong correlations: (1)
a positive and strong correlation (0.967) between External
Temperature, i.e. the mean external temperature monitored
through PWS, and Mean Temperature monitored through a
sensor deployed on the roof of the considered building. (2)
A high correlation, greater than 0:90, exists between UV
index and Solar Radiation. Since highly correlated attributes
are similar in behaviour, for each couple of attributes
highlighted in the matrix one is removed from the analysis to
reduce both the computational cost and the cardinality of
the extracted knowledge. Based on the above results, we do
not consider Mean Temperature and Solar Radiation in the
subsequent analysis process.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Cluster characterization</title>
      <p>
        FARTEC exploits the cluster analysis to identify groups
of energy consumption that occurred in similar
meteorological conditions. The K-Means algorithm has been applied on
meteorological data related to a winter period. FARTEC
represents the cluster set through (i) the singular value
decomposition (SVD) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to show the results in a graphical
and friendly way; (ii) the comparison of boxplots (one for
each cluster) for each attribute separately.
      </p>
      <p>SVD is a matrix factorization method that factorizes the
input data matrix into three matrices. It can be easily
exploited to reduce the data dimensions by only considering
the most representative attributes. Figure 3 shows the SVD
decomposition of the cluster set discovered by K-means with
K=4. Since all clusters in Figure 3 are well-separated,
Kmeans is able to identify a good cluster set.</p>
      <p>Figure 4 shows the Humidity distribution in the four
discovered clusters. The set of clusters is characterized by both
positive and negative skewness and groups of observations
are quite di erent, i.e. Cluster 1 and Cluster 2 have quite
high median values while Cluster 3 and Cluster 4 have lower
median values. In case of positive skewness, observations
increase in correspondence with the lowest values, while in the
case of negative skewness, the observations increase in
correspondence with the highest ones. Cluster 1 and Cluster 2
have a negative skewness (Q3 M e) &lt; (M e Q1), where
M e is the median, Q1 the rst quartile and Q3 the third
quartile. Data are more concentrated between the median
and the third quartile, as the same percentage of
observations falls in a smaller range. These clusters have higher
relative humidity than Cluster 3 which instead has a
positive skewness due to the presence of lower relative humidity
values.
4.3</p>
    </sec>
    <sec id="sec-10">
      <title>Analysis of extracted patterns</title>
      <p>
        Here we discuss the most interesting association rules
classi ed according to the rule template presented in Section
3.3. Since association rule mining requires a transactional
dataset of categorical values, FARTEC performs the
discretization step to convert continuously valued measurements
into categorical bins. In our case study, we used xed-size
discretized bins determined by a domain expert based on
the signi cance in the energy and meteorological context.
The used xed-size bins have been determined below. (1)
Energy consumption per unit of volume (denoted as
consumption level): two bins until 15.5 KW=m3 (o until 0.05
KW=m3, low until 15.5 KW=m3), a bin each 10 KW=m3
for values until 35.5 (medium consumption until 25.5, high
consumption until 35.5) and an additional bin for values
exceeding 35.5 KW=m3. (2) Humidity: a bin each 20% from 0
to 100%. (3) Temperature: values are discretized in ve bins
(very cold up to 5 Celsius, cold up to 10 Celsius, mild up
to 18 Celsius, hot up to 25 Celsius, very hot up to 45
Celsius). (4) Temporal data: timestamp is aggregated into the
corresponding daily time slot (e.g., morning, day, afternoon,
evening). Each day is classi ed as holiday or working, and
aggregated in week, fortnight, month, 2-month, 3-month,
6month time periods. (5) The last meteorological data have
been discretized based on meteorology criteria available in
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: precipitation level values and wind direction in have
been categorized in eight bins each; likewise UV index in six
bins; and atmospheric pressure in two bins.
      </p>
      <p>Table 2 shows the top interesting rule (with the highest
lift value) characterizing each cluster according to the rst
template. These rules are extracted from the complete set
of energy consumption related to a given building enriched
with cluster labels. Rules R1 R4 identify the most
representative meteorological item in each cluster. Through the
second template, these weather items are subsequently
combined with other meteorological items to characterize each
cluster in more detail. R1 R4 include di erent
meteorologC 1</p>
      <p>R5
ical items to characterize each cluster and this result further
highlights that the discovered groups are well-separated.</p>
      <p>Table 3 shows the most positively correlated rules (R5
R8) summarizing each cluster content. These rules,
examples of the second template, show a strong correlation among
various meteorological features, and compactly model each
discovered cluster. For example, Cluster 1 includes
meteorological data related to cold days, while Cluster 4 regards
mild days. Speci cally, Cluster 1 is characterized by very
high humidity, low pressure and rain, with the presence of
clouds and low UV index, while Cluster 4 is characterized
by mild temperatures, high pressure and light winds.</p>
      <p>Table 4 reports a subset of extracted rules according to
the third template. The rules, one for each energy
consumption level, are sorted by decreasing lift values. Rules R9 and
R12 highlight a high level of thermal energy consumption
together with various weather conditions. Speci cally, the
former means that, during rainy days, the relative humidity of
the air tends to increase as very high humidity and low
pressure imply the presence of clouds. Also the south wind is a
very weak and moist wind and cold temperature accentuates
the body's discomfort. Thus, the energy consumption level
is very high. Rules R10 and R13 instead characterize lower
thermal energy consumption. Speci cally, rule R13 means
that the wind from the Southeast is a warm and moist wind,
the humidity is high and the temperature is mild. So the
thermal energy consumption level is negligible. It is October
and the low consumption is also motivated by the fact that
temperatures are not low, despite being in the evenings.</p>
      <p>According to the discussed set of patterns, the selected
building chosen as representative has a good thermal energy
consumption level which is in line with the meteorological
factors that in uenced it.</p>
    </sec>
    <sec id="sec-11">
      <title>4.4 Summarizing and comparing energy consumption</title>
      <p>To enhance the user energy awareness of its energy
consumption, FARTEC summarizes the building energy
consumption levels over time grouped to similar meteorological
conditions. Di erent symbols and colors (see Figure 5, right)
are used for di erent energy consumption levels. Figure 5
shows the proposed graphical representation to simplify and
synthesize the energy consumption patterns (according to
the third template) over time in a compact, human-readable,
detailed and exhaustive model. This representation also
simCId RId Rule
C 1
C 3
C 2
C 1
C 4</p>
      <p>R9 fFortnight = 16-31
January, Daily time slot =
Evening, UV index =
minimum, Humidity = very high,
Temperature = cold,
Pressure = low, Wind direction
= South, Precipitations =
drizzlingg ) fConsumption
level = very highg
R10 fFortnight = 1-15 April,</p>
      <p>Daily time slot = Evening,
Precipitations = no rain,
UV index = low,
Pressure = high, Humidity =
low, Temperature = warm,
Wind direction = Southg )
fConsumption level = o g
R11 fFortnight = 16-31
December, Daily time slot = Day,
UV index = minimum,
Precipitations = no rain,
Pressure = high, Temperature =
coldg ) fConsumption level
= mediumg
R12 fFortnight = 1-15
December, Daily time slot =
Morning, UV index = minimum,
Pressure = low, Humidity
= very high, Temperature
= cold, Wind direction =
Southg ) fConsumption
level = highg
R13 fFortnight = 16-31
October, Daily time slot =
Evening, Temperature =
mild, Wind direction =
Southeast, Humidity =
highg ) fConsumption
level = lowg</p>
      <p>Supp Conf Lift
% %
0.2 100.0 153.5
0.5
pli es the comparison of thermal energy consumption
levels between two buildings. Figure 5 shows two graphs of
the four discovered clusters for the selected building. Each
graph reports the thermal energy consumption level for each
couple (daily time slot, fortnight). Speci cally, for each
cluster, rules in the form of the third template are partitioned
for each time slot and fortnight. The rule with the
highest lift value is selected and the symbol associated with the
corresponding energy consumption level is reported in the
graph. Cluster 1 and Cluster 4 are discussed as
representative because they represent orthogonal weather conditions
(cold days versus mild days). The Cluster 1 graph
(Figure 5 left) includes a large number of symbols modeling
high average consumption levels. In fact in the mornings
of the winter months consumption is high due to the bad
weather conditions. In spring and autumn there was a
reduction of the consumption level, while in every month the
evenings are characterized by a medium consumption level.
Instead the Cluster 4 graph (Figure 5 center) is
characterized by lower consumption levels because this cluster
represents mild weather conditions. Especially in spring and
autumn, consumption levels are low or negligible during the
day and afternoon time slots, while during the winter low or
medium consumption levels happen in correspondence with
some mild days.</p>
      <p>The graphical model that FARTEC uses to display the
extracted knowledge can simultaneously compare the
energy consumption levels among di erent buildings. In the
presence of di erent behaviours, users can expand the
corresponding rules compactly represented in the graph. Table 6
shows a subset of rules comparing the energy e ciency
between the previously discussed building (bi) and a new one
(building bj ). For example, R14 shows as rule antecedent
bad weather conditions that correspond to a di erent energy
e ciency of the two buildings. The former has a very high
energy consumption level while the latter high. This is due
to the di erent building size (6,297 m3 and 3,120 m3) and
di erent populations behaviour. Rule R17 instead shows an
example in which the consumption of building bi is far lower
than that of building bj . Since the fortnight corresponds to
the Christmas holidays, perhaps the people living in bi take
a holiday period away and turn o the heating system.
4.5</p>
    </sec>
    <sec id="sec-12">
      <title>Analysis of parameter setting</title>
      <p>We analyzed the robustness of the FARTEC engine to
parameter settings for both phases of analysis (i.e. cluster
analysis and association rules). The K-means algorithm
requires as input parameter the number of clusters (K), which
is in general very di cult to de ne, given the wide range
in which it may vary. To address this issue we performed
many runs of the algorithm with varying values of K, and
for each run, the cluster set is evaluated by computing the
SSE. Figure 6 shows the SSE value against the K
parameter. The smaller the SSE, the better the quality of
discovered clusters. However, as the number of cluster increases,
the SSE decreases because smaller and more cohesive
clusters are identi ed. To identify a good trade-o between the
number of clusters and their signi cance, we selected K = 4
corresponding to the maximization of the marginal decrease
in the SSE curve.</p>
      <p>To analyze the impact of traditional rule quality measures
(i.e. support, con dence and lift) on the cardinality of the
mined rule set, we performed many experiments by
varying minsup, minconf , and minlif t. We recommend users
to set low support and con dence threshold values (e.g., 1%
and 1% respectively) to avoid pruning some interesting rules
with low con dence but a high lift value. We also
recommend a minimum lift threshold equal to 1.1 to prune both
negatively correlated and uncorrelated item combinations.
5.</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSIONS AND FUTURE WORKS</title>
      <p>In this paper we presented FARTEC, a data mining
engine to analyze energy-related data through exploratory data
mining algorithms. Preliminary results on a real dataset
demonstrate the potential of the proposed methodology. We
are currently extending the FARTEC system with a
social platform where users are proactively engaged to pursue
energy-saving behaviours as well as in the act of generating
data. Users could be engaged with rewards, promoting
virtuous behaviours shared with social peers, and introducing
gaming approaches (e.g., a shared ranking of energy
ratings among neighbours). Engaged users could also provide
contextual information useful to optimize building energy</p>
      <p>External
Temperature</p>
      <p>Mean
Temperature
Precipitation</p>
      <p>Wind
Direction</p>
      <p>Solar
Radiation
UV Index
Humidity</p>
      <p>Pressure
0.477
-0.488
-0.004</p>
      <p>1
Humidity</p>
      <p>Pressure
-0.061
-0.031
-0.064
0.008</p>
      <p>1
-0.423
0.040
-0.488
-0.463
Consumption
level = very high</p>
      <p>Rule head
bi Building
(3,120 m3)
Consumption
level = high
Consumption
level = high
Consumption
level = medium
Consumption
level = low
Consumption
level = o</p>
      <p>Consumption
level = o
Consumption
level = high
Consumption
level = high
Consumption
level = low
R14
R15
R16
R17
R18
consumption.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>R. M. P. .</surname>
          </string-name>
          <article-title>The Rapid Miner Project for Machine Learning</article-title>
          . Available: http://rapid-i.
          <source>com/ Last access on December</source>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Acquaviva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Apiletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottaccioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Castagnetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cerquitelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiusano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Macii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martellacci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <article-title>Energy signature analysis: Knowledge at your ngertips</article-title>
          .
          <source>In IEEE International Congress on Big Data</source>
          ,, pages
          <fpage>543</fpage>
          {
          <fpage>550</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Acquaviva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Apiletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Castagnetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cerquitelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiusano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Macii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martellacci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <article-title>Enhancing energy awareness through the analysis of thermal energy consumption</article-title>
          .
          <source>In Workshops of the EDBT/ICDT</source>
          , pages
          <volume>64</volume>
          {
          <fpage>71</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          , and
          <string-name>
            <surname>Swami</surname>
          </string-name>
          .
          <article-title>Mining association rules between sets of items in large databases</article-title>
          .
          <source>In ACM SIGMOD</source>
          <year>1993</year>
          , pages
          <fpage>207</fpage>
          {
          <fpage>216</fpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Apiletti</surname>
          </string-name>
          , E. Baralis,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cerquitelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Venturini.</surname>
          </string-name>
          SaFe-NeC:
          <article-title>a Scalable and Flexible system for Network data Characterization</article-title>
          .
          <source>In NOMS</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ardakanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koochakzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Keshav</surname>
          </string-name>
          .
          <article-title>Computing electricity consumption pro les from household smart meter data</article-title>
          .
          <source>In EDBT/ICDT Workshops'14</source>
          , pages
          <fpage>140</fpage>
          {
          <fpage>147</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Data</surname>
          </string-name>
          . Weather Underground: Weather Forecast &amp; Reports. Available: http://www.wunderground.com/ Last access on
          <year>December 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.-H.</given-names>
            <surname>Juang</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Rabiner.</surname>
          </string-name>
          <article-title>The segmental k-means algorithm for estimating parameters of hidden markov models</article-title>
          .
          <source>IEEE Transactions on Acoustics, Speech and Signal Processing</source>
          ,
          <volume>38</volume>
          (
          <issue>9</issue>
          ):
          <volume>1639</volume>
          {
          <fpage>1641</fpage>
          ,
          <string-name>
            <surname>Sep</surname>
          </string-name>
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Meteo</surname>
          </string-name>
          .
          <article-title>Information about metereological data</article-title>
          . Available: https://en.wikipedia.org/wiki/Rain, https://en.wikipedia.org/wiki/Wind, https://en.wikipedia.org/wiki/Ultraviolet index, https://en.wikipedia.org/wiki/Atmospheric pressure Last access on
          <year>December 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Pang-Ning</surname>
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Steinbach</surname>
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kumar</surname>
            <given-names>V.</given-names>
          </string-name>
          <article-title>Introduction to Data Mining</article-title>
          . Addison-Wesley,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. van der</given-names>
            <surname>Veen</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van der Waaij,</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Meijer</surname>
          </string-name>
          .
          <article-title>Sensor data storage performance: SQL or NoSQL, physical or virtual</article-title>
          .
          <source>In IEEE Cloud Computing conference</source>
          , pages
          <volume>431</volume>
          {
          <fpage>438</fpage>
          ,
          <year>June 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wijayasekara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Linda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Manic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Rieger</surname>
          </string-name>
          .
          <article-title>Mining building energy management system data using fuzzy anomaly detection and linguistic descriptions</article-title>
          .
          <source>Industrial Informatics</source>
          , IEEE Transactions on,
          <volume>10</volume>
          (
          <issue>3</issue>
          ):
          <year>1829</year>
          {1840,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Lam</surname>
          </string-name>
          .
          <article-title>Breast cancer diagnosis based on feature extraction using a hybrid of k-means and support vector machine algorithms</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>41</volume>
          (
          <issue>4</issue>
          ):
          <volume>1476</volume>
          {
          <fpage>1482</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>