<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SiMBA: Systematic Clustering-Based Methodology to Support Built Environment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Andrea Biraghi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilia Lenzi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maristella Matera</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Tadi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Letizia Tanca</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The general interest in sustainable development models has grown enormously over the last 50 years, and architecture and urban planning are certainly two areas in which research on the topic is most advanced. At the same time, the contribution of computer science for a systematic analysis of the territory, both from a morphological point of view and as regards performance, seems to have been underestimated in today's research. In this context, our research aims to joining the two - until now separate - worlds of computer science, and architecture and urban planning. In particular, in this work we present SIMBA: Systematic clusterIng-based Methodology to support Built environment Analysis. SIMBA aims to enhance a consolidated analysis methodology, the Integrated Modification Methodology (IMM), through the integration of advanced analysis methods for the extraction of relevant patterns from built environment data. Using the city of Milan as a case study, we will demonstrate the possibility for SIMBA to be generalised to the analysis of any built environment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clustering</kwd>
        <kwd>Build environment</kwd>
        <kwd>Methodology</kwd>
        <kwd>Data mining</kwd>
        <kwd>Multidisciplinary</kwd>
        <kwd>Sustainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One of the biggest problems of our century is the impact that the behaviour of human beings
is having on the environment. This impact, under the same conditions for the eficiency in
resource management, is obviously greater in more densely populated areas. According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
currently, approximately 80% of the global primary energy is consumed in urban areas, and
cities are responsible for emitting more than 70% of the total world’s greenhouse gases and
consuming 60% of disposable water. Nonetheless, cities are the economic engine of the world.
This is why the problem of sustainability in built environments - defined as every human-made
space - has been broadly addressed in the literature. However, many issues are still open, and
diferent approaches have deficiencies in several respects, i.e., the definition of the scale at
which the analysis is conducted, the introduction of the context in the analysis, the usage of
huge amount of data and features to describe the built environment.
      </p>
      <p>
        In this paper we present SIMBA: a systematic clustering-based methodology to support built
environment analysis. SIMBA has been conceived as a framework extending and enhancing
IMM (Integrated Modification Methodology)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], an innovative design methodology for the
evaluation and the improvement of environmental performances of the city (or part of it). As
described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] IMM has already proved its efectiveness in the field of sustainability evaluation;
however, many challenges are still open and SIMBA proposes to support this methodology in
the phase of built-environment analysis. Our case study is the city of Milan which, with its
88 NILs (Nuclei di Identità Locale), has been already the subject of many applications of IMM.
In particular, the aims of the conducted experiments were (i) to enrich the phase of the built
environment analysis through the use of data clustering techniques, and (ii) to systematise the
basic steps of the analysis process by capitalizing on the support gained by data analysis. In
particular, SIMBa aims to: (i) produce a methodology to select a reasonable, yet representative
number of features when investigating the built environment; (ii) find experimental evidence of
corresponding patterns between the structure of the city and its performance; (iii) produce a
systematic method to measure the distance between elements, needed when comparing diferent
built unit; (iv) promote a human-in-the-loop paradigm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] where domain experts can intervene
and refine the analysis process. With these objectives in mind, we have chosen to base the
methodology on clustering techniques, that would support the identification of patterns in the
built environment data, which are unlabeled, and highlight which of the input elements are
more similar to each other. At the same time, such algorithms have been useful to single-out
similarities and diferences between the diferent elements, as well as the conceptual distance
between them.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. IMM Methodology and data retrieved</title>
      <p>
        The IMM methodology aims at improving the environmental performance of the city by
modifying its structural characteristics. The main elements of the methodology related to the structures
are Attributes (e.g., the height of a building) and Metrics (e.g., Building Density (BD) equal to
the total number of buildings in an area divided by the total sample area). Attributes represent
immediately measurable characteristics while Metrics result from calculations. Moreover, from
Metrics, IMM defines the Key Categories, used to represent the products of the synergy between
elementary parts of the city. At the moment, the IMM group has defined 7 Key Categories, but
for the sake of simplicity in our experiments, we will consider only Permeability (the spatial
relationship between urban built-ups and voids) and Porosity (the relationship between the
street network and the spatial components that influence the overall connectivity). The Key
Categories are used to express morphological characteristics, while the tools for performance
evaluation are called Indicators (e.g., Public Transportation Stop Density) organized into Design
Ordering Principle (DOP) families, i.e., families of actions that designers can perform to improve
the current system behaviour [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. According to these definitions, we analyze the data related to
(i) Indicators, (ii) Metrics, and (iii) Attributes, and we also perform the same experiment using
directly some raw data coming from the Municipality of Milan, which are mostly related to
air pollution, building characteristics, population, transport, and services for each NIL. The
structure of each dataset is summarised in Table 1.
      </p>
      <p>2. FIRST LEVEL
CLUSTERING
(ONE FOR EACH
DATASET)
3. SECOND LEVEL
CLUSTERING
2.3
CLUSTERING</p>
      <p>CLUSTERING</p>
      <p>CLUSTERING</p>
      <p>EVALUATION
3.1 IMPORTANT</p>
      <p>DIMENSION
CHOICE
3.2</p>
      <p>CLUSTERING ON</p>
      <p>SELECTED COMBINATION</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology and results</title>
      <p>In this section we present the SIMBA methodology together with the experiments we carried
out. Fig. 1 represents the methodological workflow. As you can see, the process takes as input a
built environment of any kind: a town, a city or even a country, since it can be generalized to
any input, as long as it is possible to identify comparable units in it. The case study discussed
in this paper focuses on Milan. The methodology is composed of three main phases: (i) Built
Environment Decomposition (BED phase); (ii) First-Level Clustering (FLC phase, one for each
dataset); (iii) Second-Lelvel Custering (SLC phase). The following sections will describe each
phase together with its application to our case study.</p>
      <sec id="sec-3-1">
        <title>3.1. BED phase setting</title>
        <p>The Built Environment Decomposition phase is useful to make the analysis manageable through
Data Mining algorithms, since it aims at reducing the number of features in the single analysis
by splitting all the available data into distinct datasets according to diferent Dimensions.</p>
        <p>In this phase, we first need to define the granularity at which the analysis has to be conducted
(step 1.1). In other words, we need to choose which are the samples we want to cluster, and
in our case study these samples are the 88 NILs. Although the definition of NIL is specific to
the city of Milan, this does not afect the generalization potential of the methodology, since
similar divisions (possibly at diferent scales) can be found in other cities, and therefore step
1.1 can be carried out. After the granularity, we need to define the Dimensions (step 1.2) and
retrieve the data according to them (step 1.3). The Dimensions represent the diferent aspects
we want to analyse. They can be of any type and category, e.g., environmental performances,
morphological characteristics, demographic data and so on, but they remain abstract concepts
for us: they have to work as simple guidelines to be used while looking for data to retrieve. In
fact, it is in the datasets that the Dimensions are represented. As far as our experiments are
concerned, we chose to focus on all the Dimensions at our disposal, as our four datasets include
both performance data and structural characteristics. Consequently, the datasets we create for
step 1.3 correspond to the ones we already defined in Section 2. Note that in some experiments
we have considered all Metrics together, in others Permeability and Porosity separately.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. FLC phase setting</title>
        <p>Once we have our datasets ready, we perform First-Level Clustering for each of them. This
phase is needed to:(i) prepare the datasets for clustering; (ii) select the important features; (iii)
perform clustering; (iiii) evaluate each obtained cluster. Outputs of this phase are:(i) diferent
clusterizations for each dataset; (ii) distances between the NILs for each dataset.</p>
        <p>
          These outputs are useful to investigate patterns in each dataset and so for each Dimension. For
what concerns the setting for this phase, note that the final choice of the techniques used for
preprocessig, the algorithm chosen for feature selection, and the clustering algorithm itself, were
dictated by the nature of the data at our disposal. On the other hand, what is also important to
underline is that, once the process has been formalised, even if the specific setting is dependent
on the input data, the procedure remains unchanged. For what concerns the pre-processing
phase, we mostly had to manage missing values and perform some feature engineering. We
strictly collaborated with the experts in this step (2.1) to preserve the data semantics as much
as possible. As far as step 2.2 is concerned, for the manual feature selection we relied totally on
the experts, asking them to select no more than 8 features for each dataset; for the automated
case instead, the selection was conducted using the entropy-based algorithm presented in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
For clustering (step 3.3), we used the Agglomerating Hierarchical Clustering algorithm [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The
choice was dictated firstly by the few samples at our disposal (only 88 NILs), and secondly by
the need to choose a criterion for selecting the number of clusters in each experiment. For this
parameter, in fact, no indications were given to us by the experts, as this type of experiment is
new to them and is mainly exploratory in nature. For this reason, to determine the value of the
parameter n_clusters we relied on the dendrograms and the Knee-Elbow graph we obtained
for each dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
        </p>
        <p>
          The last step of this phase is dedicated to the evaluation of the clusters, which is still a tricky
part in this field and most of the times is strongly application-dependent. Firstly, we tried to
have an absolute evaluation of each cluster result using internal metrics such us the Dunn and
Davies-Bouldin [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] indexes, but the problem with these metrics is that, having a small number
of samples, even few distant samples in a cluster would produce a decrease in the score. For this
reason, we decided to evaluate clustering results referring mostly to the comparison between
the results obtained with the manual and the automated choice of the features. In this way, we
use the results obtained from the manual feature selection as a proxy of ground truth, even if
they derive from a semiautomatic process.
        </p>
        <p>To do so, we first defined the comparison_matrix, a matrix used to store the number of
common elements among the clusters obtained by selecting diferent set of features. For each
experiment (dataset) we computed the comparison_matrix between the two approaches (the
manual and the automated one). The pseudo code for the computation is shown in Algorithm 1
below.</p>
        <p>Algorithm 1: Comparison_matrix computation
Input: _
Output: _
_ = _.[0]
for  = 0,  &lt; (_),  + + do
for  = 0,  &lt; (_),  + + do
_[][] = (_[] ==
_[]).()
return _</p>
        <p>In 1, the comparison_matrix CM is a matrix of dimension   , where  is the number
of NILs for the dataset. For two NILs i and j, CM [i][j] = 2 if the two methods put the NILs i and
j in the same cluster, CM [i][j] = 1 if only one method puts the two NILs in the same cluster,
and CM [i][j] = 0 otherwise. This means that the positive cases correspond to the values 0 and
2 since they mean that, in the two approaches, the clustering has produced the same result for
that pair of NILs.</p>
        <p>Finally, to evaluate each experiment, we defined the measure score, whose computation is
shown in Algorithm 2.</p>
        <p>Algorithm 2: Score computation
Input: _, _, __
Output: 
for  = 0, _.,  + + do
for  = 0, _.,  + + do
if _[][] == 0  _[][] == _
then
 =  + 1</p>
        <p>return  = __</p>
        <p>In a given experiment, for each pair of NILs, score counts how many values 0 or 2 occur
in the corresponding comparison_matrix, and normalizes this number w.r.t. the number of
NILs present in that experiment (88 or 86). Assuming that a good result for our experiment
is that the two clustering runs group all the NILs in the same way, in the Manual and in the
Automated case, score can be seen as an accuracy measure for the procedure. In addition, this
measure provides a direct and simple comparison between Manual and Automated cases.</p>
        <p>Therefore, the score provides a numerical evaluation of the clusters obtained, and allows
the procedure to be repeated in any proposed scenario. However, given the nature of the
experiments and the absence of any real ground truth, a qualitative, although less formal,
evaluation by the experts cannot be ignored. For this reason, in Figure 2 we report the most
relevant results directly on the map of Milan. Each map is divided into NILs coloured according
to the cluster they belong to. The colours have also been chosen to trace which clusters are
(a) Indicators Manual
n_clusters = 3
(b) Indicators</p>
        <p>Automated
n_clusters = 4
(c) Metrics Manual (d) Metrics Automated</p>
        <p>n_clusters = 4 n_clusters = 5
(e) Porosity Manual (f) Porosity Automated (g) Permeability all</p>
        <p>n_clusters = 2 n_clusters = 2 n_clusters = 4
closest to each other; the outliers are always coloured red, therefore, in the figure, they represent
a single cluster, but they contribute individually to increase the n_cluster parameter which,
as we note, varies among the diferent experiments. In addition, black is used to highlight the
NILs that are missing in some of the datasets. In table 2 we also report the scores obtained for
all the datasets.</p>
        <p>Figures 2a and 2b represent the clusters obtained by applying the Manual and the Automated
algorithms to the Indicators dataset. What emerges from these maps is that the two algorithms
provide almost the same results. Indeed, one might think that indicators only highlight
macrodiferences among NILs, but the results have been judged by the experts to be perfectly coherent
with the way DOP families describe NILs. Moreover, as shown in 2, the score is rather close to
the number of NILs (88).</p>
        <p>Looking at the results related to the Metrics dataset (Figure 2c and 2d) we note that, in this
case, the two algorithms (Manual and Automated) perform really diferently. What emerges is
that the features selected manually can express finer diferences, while the automated algorithm
only separates the central NILs from the more peripheral ones, with some exceptions in the very
peripheral areas. This is probably due to the fact that in the Manual case experts selected both
Porosity and Permeability metrics to balance their efects, while the entropy-based algorithm
mostly selected porosity-related ones. This is probably due to the imbalance between the
number of metrics related to Porosity (52) and those related to Permeability (7) that afected the
automated selection of the features.</p>
        <p>This is confirmed by looking at the remaining maps in Figure 2 where we note that using
only Porosity-related metrics we have a sharp separation between centre and periphery NILs,
while, using the ones related to Permeability, it is possible to define some clusters also in the
peripheral areas. This gives us an idea of how SIMBA can be useful to study the efect of the
diferent Key Categories separately or jointly, and this is of huge importance for the IMM group.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. SLC phase setting</title>
        <p>Second-Level Clustering is the third and last phase of SIMBA, and it takes two inputs: (i) the
First-Level Clustering results, to identify on which dataset the clustering performed better; (ii)
the IMM experts’ indications about the Dimensions to focus on to continue the analysis (step
3.1); and produces two outputs:(i) the clustering results on the dataset created by combining the
important Dimensions; (ii) a formalization of distances between NILs considering only selected
features of the important Dimensions.</p>
        <p>In the previous sections we showed the results obtained by clustering NILs keeping the
datasets separated. This was an important step, since it allowed us to understand what are the
important IMM elements to focus on, both in an automated way and in a more guided one.
As we have seen, both looking at the score values and at the produced maps, the Indicators
dataset was the one where the procedure performed best. This is probably due to the fact that
the indicators themselves are well separated into DOP families and the entropy-based
featureselection algorithm can extract at least one indicator for each family. Other interesting results,
as we have shown, are obtained on the Metrics dataset. This time the score is considerably
worse, but the meaning of the resulting clusters in terms of morphology can’t be ignored. This
is actually why we decided to add this last phase to the methodology, applying clustering to a
dataset composed both by indicators and metrics (step 3.2). The results are reported in Figure 3.</p>
        <p>By comparing maps, three important things emerge:(i) most of the NILs of the city centre are
always in the same cluster; (ii) looking at the manual case, this seems to show practically the
same results as the Indicator case, while the automated one contains also some of the clusters
identified in the Metrics automated case. This might mean that there could be some indicators
that guide the whole creation of clusters, but some of the metrics selected in the automated
case end up smoothing its efect.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and future works</title>
      <p>In this work, we presented SIMBA, a systematic clustering-based methodology to support
built-environment analysis, and we proved its potential as a support tool for architects and
urban planners in understanding and analysing urban environments. The biggest limitation of
the work is the nature of the data. On the one hand, the scarcity of the samples (only 88 NILs)
makes the analysis sensitive to outliers and dificult to formally generalise; on the other hand,
the excessive number of characteristics, which are often redundant and not very informative,
afects the quality of the results produced. Two aspects on which future work will therefore
focus are the definition of finer granularity and the selection of features. Additional eforts
will be devoted to refining the human-in-the-loop approach that SIMBA wants to support, by
means of explanations that can guide the domain experts (and eventually, diferent experts)
in the selection of relevant analysis Dimensions in the SLC phase. In addition, to assess the
actual possibility for SIMBA to be generalised to the analysis of any built environment, diferent
datasets with diferent characteristics should be analyzed.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgments</title>
      <p>We would like to thank Dott. Hadi Mohammad Zadeh who has contributed to this work and its
future development.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bank</surname>
          </string-name>
          ,
          <article-title>Cities and climate change: an urgent agenda</article-title>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tadi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. H. M. Zadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Ogut</surname>
          </string-name>
          ,
          <article-title>Measuring the influence of functional proximity on environmental urban performance via integrated modification methodology: Four study cases in milan</article-title>
          ,
          <source>International Journal of Urban and Civil Engineering</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>51</volume>
          (
          <year>2019</year>
          )
          <volume>93</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>93</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dash</surname>
          </string-name>
          , H. Liu,
          <article-title>Feature selection for clustering, ACM digital library (</article-title>
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wunsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          , Clustering, IEEE Press Series on Computational Intelligence, Wiley,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wagner Meira</surname>
          </string-name>
          ,
          <source>Data Mining and Machine Learning: Fundamental Concepts and Algorithms</source>
          , Cambridge University Press,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>