<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Conference on Big Data and Data Science, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Detection for Physical Threat Intelligence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Mignone</string-name>
          <email>paolo.mignone@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donato Malerba</string-name>
          <email>donato.malerba@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Ceci</string-name>
          <email>michelangelo.ceci@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Big Data Lab, National Interuniversity Consortium for Informatics (CINI)</institution>
          ,
          <addr-line>Via Ariosto, 25, 00185, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science</institution>
          ,
          <addr-line>Via Orabona, 4, 70125</addr-line>
          ,
          <institution>University of Bari Aldo Moro</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>0</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Anomaly detection is a machine learning task that has been investigated within diverse research areas and application domains. In this paper, we performed anomaly detection for Physical Threat Intelligence. Specifically, we performed anomaly detection for air pollution and public transport trafic analysis for the city of Oslo, Norway. To this aim, the state-of-the-art method SparkGHSOM was considered to learn predictive models for normal (i.e. regular) scenarios of air quality and trafic jams in a distributed fashion. Furthermore, we extended the main algorithm to make the detected anomalies explainable through an instance-based feature ranking approach. The results showed that SparkGHSOM is able to detect anomalies for both the real applications considered in this study, despite the fact it was designed for diferent tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Anomaly detection</kwd>
        <kwd>Air pollution</kwd>
        <kwd>Public transport trafic</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Anomaly detection is a machine learning task that refers to the problem of identifying data
that do not conform to patterns observed in historical data. These patterns represent the
expected behaviour in normal conditions. Therefore, anomaly detection is usually performed
through a data-driven algorithm to construct a model which will be able to detect a specific
measurement/object/instance/observation as anomalous with respect to the historical data
already seen. Anomaly detection is a very general task that finds applications in many
realdomain scenarios such as fraud detection for credit cards, insurance, or health care, intrusion
detection for cyber-security, fault detection in safety-critical systems, and military surveillance
for enemy activities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In this paper, we consider the Anomaly Detection task for the purposes of Physical Threat
Intelligence. Specifically, we propose an algorithm for anomaly detection which works on data
continuously collected by geo-located sensors located in urban areas. The data refer to physical
information (e.g. temperature, number of vehicles crossing a gate, number of pedestrians in a
given area, PM10 level at certain points in the town, etc.). The goal is to identify an anomalous,
not expected, behaviour for one or many values simultaneously, considering the specific time,
LGOBE
(M. Ceci)
http://www.di.uniba.it/~ceci/ (M. Ceci)
http://www.di.uniba.it/~mignone/ (P. Mignone); http://www.di.uniba.it/~malerba/ (D. Malerba);
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
date and spatial coordinates of the considered observation. This would give the opportunity to
Security Operators to understand potentially dangerous situations and take the appropriate
actions in time.</p>
      <p>
        The task we consider hereby is particularly challenging since data generated by sensors are big
in size and have spatial and temporal coordinates that make the data not independent. Indeed,
the spatial proximity of sensors introduces spatial autocorrelation in functional annotations and
violates the usual assumption that observations are independently and identically distributed
(i.i.d.). Although the explicit consideration of these spatial dependencies brings additional
complexity to the learning process, it generally leads to increased accuracy of learned models
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In addition, data generated by sensors are also afected by temporal autocorrelation, since
they:
i) tend to have similar values at the same time on close days;
ii) have a cyclic and seasonal (over days and years) behavior;
iii) tend to show the same trend over time.
      </p>
      <p>
        While stream mining algorithms deal with both i) and ii), they may fail to consider iii), since
they tend to better represent the most recently observed concepts, forgetting previously learned
ones [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. On the contrary, time series-based approaches are able to deal with iii), but may fail to
consider i) and ii). In fact, they typically require the size of the temporal horizon as an input:
Considering a short-term horizon (e.g., daily) excludes a long-term horizon (e.g., seasonal) and
vice versa. On the contrary, in the approach presented in this paper, we propose a time-series
approach that exploits both spatial and temporal features, in order to take into account all
the aspects mentioned before. In particular, the method addresses the problem of identifying
complex spatio-temporal patterns in sensor data by means of Self-Organizing Maps (SOMs).
      </p>
      <p>
        A SOM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a neural-network-based clustering algorithm that operates by mapping
highdimensional input data into a 2-dimensional space implemented by a grid of neurons called
feature map. In this paper, we consider GHSOMs, (Growing Hierarchical SOMs) that are
particularly suitable for time series data and better capture spatio-temporal information thanks
to the hierarchical organization of the SOMs that better adapt to complex data distribution.
Specifically, we consider the distributed extension Spark-GHSOM [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], that exploits the Spark
architecture to process massive data, like those coming from sensors. Since GHSOMs are
designed for clustering and not for anomaly detection tasks, we extend the learning algorithm
Spark-GHSOM in order to learn GHSOMs for anomaly detection, in an unsupervised fashion.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Spark-GHSOM</title>
      <p>
        Spark-GHSOM [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was introduced to overcome two limitations of the classical GHSOMs. Indeed,
a GHSOM i) requires multiple iterations over the input dataset making it intractable on large
datasets; ii) it is designed to handle datasets with numeric attributes only, representing an
important limitation as most modern real-world datasets are characterized by mixed attributes
(numerical and categorical). Therefore, Spark-GHSOM exploits the Spark platform to process
massive datasets in a distributed fashion. Furthermore, it exploits the distance hierarchy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
to modify the optimization function of GHSOM so that it can (also) coherently handle
mixedattribute datasets. Spark-GHSOM showed high accuracy, scalability, and descriptive power on
diferent datasets.
      </p>
      <p>
        The first step in the GHSOM algorithm is to compute the inherent dissimilarity in the
input data with diferent types of attributes. Classical GHSOMs exploit the mean quantization
error. However, this error is suitable for numerical attributes only. While there is no standard
definition of mean for categorical attributes, SparkGHSOM replaces the mean quantization error
by considering instead the variance in order to assess the quality of the map and neurons. For
categorical attributes, unlikability is a good measure to estimate how often the values difer from
one another [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Formally, let  the dataset under analysis, the unlikability for a categorical
attribute A of  is defined as:
(1)
(2)
() =
∑
      </p>
      <p>(1 −   )
∈()
where   =  ( ||  ,) ,   is the i-th value of the attribute A and    (  , ) is the
absolute frequency of the value   for the attribute A in  . Therefore, SparkGHSOM computes
the overall variance of the dataset as follows:
 =</p>
      <p>∑
∈ 
1()
 () +
1()
()
2
where 1() (resp. 1() ) is 1 when the attribute A is numerical (resp. categorical), 0
otherwise.  () represents the classical variance for the attribute A when it is numerical.</p>
      <p>
        The distance hierarchy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is considered to compute the similarities among the categorical
values. To compute the distance among categorical values, a distance hierarchy for each
categorical attribute must be provided in advance. Similar values according to the concept
hierarchy are placed under a common parent which represents an abstract concept. The GHSOM
training process takes into account mixed attributes and consists in finding the winner (closest)
neuron of the SOM w.r.t. the single input instance according to the distance hierarchy.
      </p>
      <p>In the first step, the winner neuron is identified for the input instance according to the
distance hierarchy. Therefore, the neuron’s weight vector is modified by a certain amount to
match the instance vector. In the hierarchy tree of the concepts, where the leaves represent the
actual values of the instances and the non-leaf nodes represent the neurons, this process pulls
the neuron point towards its leaf in order to ”specialize” what the neuron describes.</p>
      <p>In the second step, the closest winner neuron and its surrounding neighbor neurons of the
SOM are adapted moving them towards the input instance. This training process requires a
defined number of training epochs over the input dataset. The training is governed by the Mean
Quantization Error (MQE) of a neuron, that is the total deviation of the neuron from its mapped
input instances. The MQE for a SOM layer is computed as the average MQE of all the neurons
representing instances. A higher value of the MQE means that the layer does not represent the
input data well and requires more neurons to better represent the input domain. Moreover,
when a single neuron is still not representing the surrounding instances, then the neuron is
expanded as a SOM hierarchically (see figure 1).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Spark-GHSOM for Anomaly Detection</title>
      <p>
        The training process of the Spark-GHSOM follows the classical process of the GHSOM training,
except for the use of a diferent function for the calculation of the distance between the input
vector and the neurons of the feature map, since the Euclidean distance is not computable on
categorical attributes. For this reason, the hierarchical distance was chosen [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ].
      </p>
      <p>The hierarchy obtained can thus be used to solve an anomaly detection task. In particular,
when a new input vector is supplied to the hierarchy, the algorithm looks for the SOM that
succeeds in better approximating the input data (that is, the SOM with the shortest distance
with respect to the input vector). Once found, it is used to carry out the prediction for the new
input data, based on the distance between the input vector and the closest neuron (the winner
neuron) in the map.</p>
      <p>More formally, let   be the new example to be considered, and let (  ) = arg min (

the closest neuron to   according to the distance measure described before, the example is
 , )
considered an anomaly if the following inequality holds:
(
 , (  )) &gt; ( 
+  ∗  )
(3)
In the formula,</p>
      <p>is the average distance among the training instances and the neurons of
the model after the training,  the standard deviation of such distances, and  the user-defined
threshold.</p>
      <p>As data distributions tend to change over time, it may be necessary to update the knowledge
of the anomaly detector using more recent data. For this reason, Spark-GHSOM for anomaly
detection provides the possibility to update the weights vectors of the neurons while keeping the
generated hierarchy unchanged. This process can be particularly useful if end users do not have
enough time or data availability to train a new anomaly detector from scratch. Consequently,
having a pre-trained model already available, it is possible to provide the model with a
microbatch of data, in order to update the knowledge extracted by the model and adapt it to the user’s
needs. This aspect is particularly useful in our case, where data generated by the sensors can be
relatively few.</p>
      <p>The anomaly detector could produce diferent types of output depending on the level of detail.
The simplest approach provides feedback for the current data in the form of a Boolean response.
This kind of output could support raising an alert if the response is equal to “anomaly”.</p>
      <p>This approach presents the advantage that is simple to handle and transmits the prediction as
a binary variable (e.g., anomaly/normal, 0/1, true/false). Its drawback is that it makes it dificult
for the end-user to interpret the raised alert/anomaly. Therefore, a more informative approach
could be considered by combining the previous one with a ranking of the variables (feature
ranking) according to their importance, indicating the contribution to catching the variable’s
anomaly.</p>
      <p>Feature ranking is a ranking of the entire set of features composing the data collection, ordered
with respect to the feature importance. Feature importance is a numerical value between 0 and
1, which expresses how anomalous the value expressed by the feature is with respect to the
data collection, such that the sum of all the features importance values in the feature ranking is
equal to 1. The importance score is determined starting from a distance function between the
current data under analysis and the winner neuron. Specifically, the ranking is proportional to
the contribution provided by the single feature in the Euclidean distance between   and (  ).
More formally, the ranking function for the instance   ,   (  ), is computed as follows:
  (  ) =</p>
      <p>(  [] − (  )[]) 2
∑ ′(  [ ′] − (  )[ ′])2
(4)
where  represents the feature index.</p>
      <p>This approach helps to identify the feature(s) that most contributed to the anomaly and,
therefore, the ”reason” for the anomaly.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>Air pollution analysis</title>
        <p>The experiments were conducted for the city of Oslo (Norway) by considering two real domains
for the following analyses: air pollution and public transport trafic.</p>
        <p>The proposed method was tested using data coming from air quality monitoring sensors to
identify pollutant concentrations deemed abnormal. The considered locations within the city of
Oslo are shown in Figure 2.</p>
        <p>At each location, diferent pollutants are monitored by the sensors:
• Hjortnes: NO, NO2, NOx, PM10 and PM2.5
• Loallmenningen: NO, NO2, NOx, PM1, PM10 and PM2.5
• Spikersuppa: PM10 and PM2.5
The information on the concentration of pollutants comes with both a timestamp and the
geo-coordinates (latitude and longitude), so that the time series can be reconstructed. Data,
which is publicly available, can be downloaded through a REST API 1.</p>
        <p>The period considered for training was from January 2021 to September 2021, with an hourly
sampling rate, totalling 18.286 data points from the chosen locations. The period considered for
testing is October 2021, totalling 720 acquisitions from the chosen locations. The best value
for the parameter  has been selected according to internal cross-validation on the training
instances in the interval [0, 15].</p>
        <p>Figure 3 shows the concentrations per hour of NO, NOx, and NO2 pollutants during the
identified test period, i.e., October 2021, from the station of Hjortnes. The choice fell on these
pollutants because they are present within the top-3 of the feature ranking, for those time
instants considered anomalous by the algorithm, indicated with black arrows in the graph.</p>
        <p>It is worth noting that we did not find an abnormal situation on October 21 at 10 a.m.,
indicated with a green arrow in Figure 4, when very high concentrations of PM1 were recorded,
even though at this time point the pollutant PM1 is correctly present in the first position of the
feature ranking.</p>
        <p>The motivation is because several pollutants are being observed together and the sudden
increase of concentrations of one of them is sometimes not suficient to classify the time instant
as a potential abnormal situation.</p>
        <p>Figure 5 shows the concentration per hour of PM1 pollutant during the test period, from
Loallmenningen. For this place, PM1 is the most decisive pollutant for the detection of abnormal
situations that occurred during October 2021.</p>
        <p>As in the previous graphs, the black arrows indicate the time instants in which we detected
abnormal concentrations of the pollutants considered. As expected, the algorithm was able to
correctly detect high concentrations of the PM1 pollutant.</p>
        <p>On October 26 at 9 p.m., as indicated by the green arrow, the concentrations of PM1 were
very similar to those of October 27 at 4 p.m., however only in the latter case, an anomalous
situation was found by the algorithm. A more detailed graph is shown in Figure 6.</p>
        <p>The reason is due to a sudden increase in concentrations of the remaining pollutants, which
occurred on October 27 at 4 p.m. This situation, as shown in Figure 7, allowed the algorithm
to identify an anomalous situation at this timestamp. Figure 8 shows the concentrations per
hour of PM10 and PM2.5 pollutants during the test period, from the area of Spikersuppa. The
pollutants shown in the graph are the only ones the station can monitor. As expected, the
algorithm did not identify any situations deemed abnormal for this place, as the concentrations
of October are quite regular.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Public transport trafic</title>
        <p>This data consists of one week of data regarding Oslo’s public transport. The instances represent
GPS-tracked busses with latitude and longitude. Each instance is timestamped according to the
standard ISO 8601 with a resolution in seconds. The Service Interface for Real time Information
Vehicle Monitoring (SIRI-VM) is used to model vehicle movements and their progress compared
to a planned timetable 2.</p>
        <p>
          For this dataset, the processing pipeline illustrated in Figure 9 was executed. Therefore,
starting from the week of data from Oslo trafic transport, we performed data cleaning in order
to fix some encoding issues. We also aggregated data by 5-minutes interval periods and by
spatial areas according to some preliminary clustering. This step was crucial since the data
provided refer to movable points in the map making the aggregation operations unfeasible.
Clustering on the spatial location was performed by exploiting K-Means algorithm [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The
variables of the considered data were extended by considering the cluster identifier (cluster ID)
and the cluster’s centroid latitude and longitude to the data. Since K-Means algorithm needs
the number of clusters to identify, we performed the well-known silhouette cluster analysis [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
with the aim to identify the number of areas for monitoring the trafic. According to silhouette
analysis, we considered 100 diferent regions for trafic monitoring (see Figure 10).
        </p>
        <p>The instances are therefore grouped by two levels: first the time, then the cluster id previously
identified. Various new features are computed as part of the aggregation (e.g., the average
“delay” of the buses in seconds) for each identified clustered monitoring area. Multiple training
and test sets were created as illustrated in figure 11. The  -th evaluation step uses  hours for
training, and the ( + 1 )-th hour for testing. The 10% of the available test windows are perturbed
by randomly selecting 3 columns for each instance and randomly assigning a new value for
each selected feature. These test windows are considered anomalous. The remaining 90% of the
available test windows are used without perturbation and considered non-anomalous for the
evaluation. The aim of this setting is to perform an evaluation based on landmark windows.
The best value for the parameter  has been selected according to internal cross-validation on
the training instances in the interval [0, 15].</p>
        <p>In Figure 12 hour-by-hour histograms are reported for the first day. Stacked green bars
indicate the correct predictions, while the red ones the wrong predictions. The red text in the
2https://api.entur.io/realtime/v1/rest/vm?datasetId=RUT
date indicates that the window is perturbed (anomaly). The top label contains the total number
of instances in the test set. During normal windows, the anomaly detector results are efective
since false positives are generally avoided. Most of the normal scenarios that occurred during
diferent time slots (in the morning, afternoon, evening, and night) were recognized as normal
situations: 99.7% accuracy (we have only 5 false positives at the beginning, when the model is
still unstable). From the figure, we can also see that the system identifies many false negatives at
the beginning [01:40-03:40]. This is expected since the model is still unstable to detect anomalies.
Moreover, the lack of data, due to the lack of public transport late in the night (or early in the
morning, only 48 instances), further complicated the problem. During the day, after 22 hours of
training, the anomaly detector appears to be much more stable and capable to predict most of
the anomalies occurred during the two-hours anomalous time slot [16:40-18:40] in the afternoon.
After 26 hours of training, the anomaly detector becomes further stable and capable to predict
most of the anomalies occurred during the anomalous time slot [20:40-21:40] in the evening.
After 28 hours of training, the anomaly detector becomes furthermore stable and capable to
predict most of the anomalies occurred during the anomalous time slot [05:40-06:40] in the
evening/early morning. In table 1, we report the overall quantitative results which confirm
the fact that the algorithm, after suficient data for training, shows very high prediction scores,
with very high precision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>
        In this paper, we tackle the task of anomaly detection. For this purpose, we extended the
algorithm SparkGHSOM, originally designed for the clustering task, in order to consider the
task at hand. Furthermore, the main algorithm has been made more explainable by providing
the reasons for each detected anomaly in the form of an instance-based feature ranking. The
results show the efectiveness of the proposed approach both qualitatively and quantitatively
in real application scenarios. For future work, we aim to perform further and more robust
experiments with the aim to better evaluate the predictive quality, the explainability, and the
scalability of this new extended version of SparkGHSOM. From an architectural viewpoint,
we aim to provide anomaly detection as an additional service according to the a model-based
approach for Big Data Analytics-as-a-service [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
We acknowledge the project IMPETUS (Intelligent Management of Processes, Ethics and
Technology for Urban Safety) that receives funding from the European Union’s Horizon 2020 research
and innovation programme under grant agreement No. 883286.
https://cordis.europa.eu/project/id/883286. Dr. Paolo Mignone acknowledges the support of Apulia Region through the REFIN
project “Metodi per l’ottimizzazione delle reti di distribuzione di energia e per la pianificazione
di interventi manutentivi ed evolutivi” (CUP H94I20000410008, Grant n. 7EDD092A).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Malondkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Corizzo</surname>
          </string-name>
          , I. Kiringa,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          ,
          <article-title>Spark-ghsom: Growing hierarchical self-organizing map for large scale mixed attribute datasets</article-title>
          ,
          <source>Information Sciences</source>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1016/j.ins.
          <year>2018</year>
          .
          <volume>12</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Stojanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Appice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Džeroski</surname>
          </string-name>
          ,
          <article-title>Network regression with predictive clustering trees</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>25</volume>
          (
          <year>2012</year>
          )
          <fpage>378</fpage>
          -
          <lpage>413</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10618-012-0278-6.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Gonçalves</surname>
          </string-name>
          <string-name>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Barros</surname>
          </string-name>
          ,
          <article-title>Rcd: A recurring concept drift framework</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          <volume>34</volume>
          (
          <year>2013</year>
          )
          <fpage>1018</fpage>
          -
          <lpage>1025</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.patrec.
          <year>2013</year>
          .
          <volume>02</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kohonen</surname>
          </string-name>
          ,
          <article-title>The self-organizing map</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>78</volume>
          (
          <year>1990</year>
          )
          <fpage>1464</fpage>
          -
          <lpage>1480</lpage>
          . doi:
          <volume>10</volume>
          .1109/5.58325.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>C.-C. Hsu</surname>
          </string-name>
          ,
          <article-title>Generalizing self-organizing map for categorical data</article-title>
          ,
          <source>IEEE Transactions on Neural Networks</source>
          <volume>17</volume>
          (
          <year>2006</year>
          )
          <fpage>294</fpage>
          -
          <lpage>304</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNN.
          <year>2005</year>
          .
          <volume>863415</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Kader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perry</surname>
          </string-name>
          ,
          <article-title>Variability for categorical variables</article-title>
          ,
          <source>Journal of Statistics Education</source>
          <volume>15</volume>
          (
          <year>2007</year>
          ). doi:
          <volume>10</volume>
          .1080/10691898.
          <year>2007</year>
          .
          <volume>11889465</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <article-title>Least squares quantization in pcm</article-title>
          ,
          <source>IEEE Transactions on Information Theory</source>
          <volume>28</volume>
          (
          <year>1982</year>
          )
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          . doi:
          <volume>10</volume>
          .1109/TIT.
          <year>1982</year>
          .
          <volume>1056489</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rousseeuw</surname>
          </string-name>
          ,
          <article-title>Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</article-title>
          ,
          <source>Journal of Computational and Applied Mathematics</source>
          <volume>20</volume>
          (
          <year>1987</year>
          )
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          . 1016/
          <fpage>0377</fpage>
          -
          <lpage>0427</lpage>
          (
          <issue>87</issue>
          )
          <fpage>90125</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Redavid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Corizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Malerba</surname>
          </string-name>
          ,
          <article-title>An owl ontology for supporting semantic services in big data platforms</article-title>
          ,
          <source>in: 2018 IEEE International Congress on Big Data (BigData Congress)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>228</fpage>
          -
          <lpage>231</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigDataCongress.
          <year>2018</year>
          .
          <volume>00039</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>