Anomaly Detection for Physical Threat Intelligence
Paolo Mignone1,2 , Donato Malerba1,2 and Michelangelo Ceci1,2
1
    Department of Computer Science, Via Orabona, 4, 70125, University of Bari Aldo Moro, Bari, Italy
2
    Big Data Lab, National Interuniversity Consortium for Informatics (CINI), Via Ariosto, 25, 00185, Rome, Italy


                                         Abstract
                                         Anomaly detection is a machine learning task that has been investigated within diverse research areas
                                         and application domains. In this paper, we performed anomaly detection for Physical Threat Intelligence.
                                         Specifically, we performed anomaly detection for air pollution and public transport traffic analysis for
                                         the city of Oslo, Norway. To this aim, the state-of-the-art method SparkGHSOM was considered to
                                         learn predictive models for normal (i.e. regular) scenarios of air quality and traffic jams in a distributed
                                         fashion. Furthermore, we extended the main algorithm to make the detected anomalies explainable
                                         through an instance-based feature ranking approach. The results showed that SparkGHSOM is able to
                                         detect anomalies for both the real applications considered in this study, despite the fact it was designed
                                         for different tasks.

                                         Keywords
                                         Anomaly detection, Air pollution, Public transport traffic


1. Introduction
Anomaly detection is a machine learning task that refers to the problem of identifying data
that do not conform to patterns observed in historical data. These patterns represent the
expected behaviour in normal conditions. Therefore, anomaly detection is usually performed
through a data-driven algorithm to construct a model which will be able to detect a specific
measurement/object/instance/observation as anomalous with respect to the historical data
already seen. Anomaly detection is a very general task that finds applications in many real-
domain scenarios such as fraud detection for credit cards, insurance, or health care, intrusion
detection for cyber-security, fault detection in safety-critical systems, and military surveillance
for enemy activities [1].
   In this paper, we consider the Anomaly Detection task for the purposes of Physical Threat
Intelligence. Specifically, we propose an algorithm for anomaly detection which works on data
continuously collected by geo-located sensors located in urban areas. The data refer to physical
information (e.g. temperature, number of vehicles crossing a gate, number of pedestrians in a
given area, PM10 level at certain points in the town, etc.). The goal is to identify an anomalous,
not expected, behaviour for one or many values simultaneously, considering the specific time,
ITADATA2022: The 1𝑠𝑡 Italian Conference on Big Data and Data Science, September 20–21, 2022, Milan, Italy
Envelope-Open paolo.mignone@uniba.it (P. Mignone); donato.malerba@uniba.it (D. Malerba); michelangelo.ceci@uniba.it
(M. Ceci)
GLOBE http://www.di.uniba.it/~mignone/ (P. Mignone); http://www.di.uniba.it/~malerba/ (D. Malerba);
http://www.di.uniba.it/~ceci/ (M. Ceci)
Orcid 0000-0002-8641-7880 (P. Mignone); 0000-0001-8432-4608 (D. Malerba); 0000-0002-6690-7583 (M. Ceci)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
date and spatial coordinates of the considered observation. This would give the opportunity to
Security Operators to understand potentially dangerous situations and take the appropriate
actions in time.
    The task we consider hereby is particularly challenging since data generated by sensors are big
in size and have spatial and temporal coordinates that make the data not independent. Indeed,
the spatial proximity of sensors introduces spatial autocorrelation in functional annotations and
violates the usual assumption that observations are independently and identically distributed
(i.i.d.). Although the explicit consideration of these spatial dependencies brings additional
complexity to the learning process, it generally leads to increased accuracy of learned models
[2]. In addition, data generated by sensors are also affected by temporal autocorrelation, since
they:
    i) tend to have similar values at the same time on close days;
   ii) have a cyclic and seasonal (over days and years) behavior;
  iii) tend to show the same trend over time.
While stream mining algorithms deal with both i) and ii), they may fail to consider iii), since
they tend to better represent the most recently observed concepts, forgetting previously learned
ones [3]. On the contrary, time series-based approaches are able to deal with iii), but may fail to
consider i) and ii). In fact, they typically require the size of the temporal horizon as an input:
Considering a short-term horizon (e.g., daily) excludes a long-term horizon (e.g., seasonal) and
vice versa. On the contrary, in the approach presented in this paper, we propose a time-series
approach that exploits both spatial and temporal features, in order to take into account all
the aspects mentioned before. In particular, the method addresses the problem of identifying
complex spatio-temporal patterns in sensor data by means of Self-Organizing Maps (SOMs).
   A SOM [4] is a neural-network-based clustering algorithm that operates by mapping high-
dimensional input data into a 2-dimensional space implemented by a grid of neurons called
feature map. In this paper, we consider GHSOMs, (Growing Hierarchical SOMs) that are
particularly suitable for time series data and better capture spatio-temporal information thanks
to the hierarchical organization of the SOMs that better adapt to complex data distribution.
Specifically, we consider the distributed extension Spark-GHSOM [1], that exploits the Spark
architecture to process massive data, like those coming from sensors. Since GHSOMs are
designed for clustering and not for anomaly detection tasks, we extend the learning algorithm
Spark-GHSOM in order to learn GHSOMs for anomaly detection, in an unsupervised fashion.


2. Spark-GHSOM
Spark-GHSOM [1] was introduced to overcome two limitations of the classical GHSOMs. Indeed,
a GHSOM i) requires multiple iterations over the input dataset making it intractable on large
datasets; ii) it is designed to handle datasets with numeric attributes only, representing an
important limitation as most modern real-world datasets are characterized by mixed attributes
(numerical and categorical). Therefore, Spark-GHSOM exploits the Spark platform to process
massive datasets in a distributed fashion. Furthermore, it exploits the distance hierarchy [5]
to modify the optimization function of GHSOM so that it can (also) coherently handle mixed-
attribute datasets. Spark-GHSOM showed high accuracy, scalability, and descriptive power on
different datasets.
   The first step in the GHSOM algorithm is to compute the inherent dissimilarity in the
input data with different types of attributes. Classical GHSOMs exploit the mean quantization
error. However, this error is suitable for numerical attributes only. While there is no standard
definition of mean for categorical attributes, SparkGHSOM replaces the mean quantization error
by considering instead the variance in order to assess the quality of the map and neurons. For
categorical attributes, unlikability is a good measure to estimate how often the values differ from
one another [6]. Formally, let 𝔻 the dataset under analysis, the unlikability for a categorical
attribute A of 𝔻 is defined as:

                                    𝕌(𝐴) =           ∑         𝑝𝑖 (1 − 𝑝𝑖 )                    (1)
                                                 𝑖∈𝑑𝑜𝑚𝑎𝑖𝑛(𝐴)

             𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝐴 ,𝔻)
                       𝑖
where 𝑝𝑖 =        |𝔻|
                           , 𝐴𝑖 is the i-th value of the attribute A and 𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝐴𝑖 , 𝔻) is the
absolute frequency of the value 𝐴𝑖 for the attribute A in 𝔻. Therefore, SparkGHSOM computes
the overall variance of the dataset as follows:

                                                                              𝕌(𝐴)
                            𝜎=       ∑           1𝑛𝑢𝑚(𝐴) 𝜎 (𝐴) + 1𝑐𝑎𝑡(𝐴)                       (2)
                                 𝐴∈𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠𝑒𝑡
                                                                               2

where 1𝑛𝑢𝑚(𝐴) (resp. 1𝑐𝑎𝑡(𝐴) ) is 1 when the attribute A is numerical (resp. categorical), 0
otherwise. 𝜎 (𝐴) represents the classical variance for the attribute A when it is numerical.
   The distance hierarchy [5] is considered to compute the similarities among the categorical
values. To compute the distance among categorical values, a distance hierarchy for each
categorical attribute must be provided in advance. Similar values according to the concept
hierarchy are placed under a common parent which represents an abstract concept. The GHSOM
training process takes into account mixed attributes and consists in finding the winner (closest)
neuron of the SOM w.r.t. the single input instance according to the distance hierarchy.
   In the first step, the winner neuron is identified for the input instance according to the
distance hierarchy. Therefore, the neuron’s weight vector is modified by a certain amount to
match the instance vector. In the hierarchy tree of the concepts, where the leaves represent the
actual values of the instances and the non-leaf nodes represent the neurons, this process pulls
the neuron point towards its leaf in order to ”specialize” what the neuron describes.
   In the second step, the closest winner neuron and its surrounding neighbor neurons of the
SOM are adapted moving them towards the input instance. This training process requires a
defined number of training epochs over the input dataset. The training is governed by the Mean
Quantization Error (MQE) of a neuron, that is the total deviation of the neuron from its mapped
input instances. The MQE for a SOM layer is computed as the average MQE of all the neurons
representing instances. A higher value of the MQE means that the layer does not represent the
input data well and requires more neurons to better represent the input domain. Moreover,
when a single neuron is still not representing the surrounding instances, then the neuron is
expanded as a SOM hierarchically (see figure 1).
Figure 1: A Growing Hierarchical Self Organizing Map.


3. Spark-GHSOM for Anomaly Detection
The training process of the Spark-GHSOM follows the classical process of the GHSOM training,
except for the use of a different function for the calculation of the distance between the input
vector and the neurons of the feature map, since the Euclidean distance is not computable on
categorical attributes. For this reason, the hierarchical distance was chosen [1, 5].
  The hierarchy obtained can thus be used to solve an anomaly detection task. In particular,
when a new input vector is supplied to the hierarchy, the algorithm looks for the SOM that
succeeds in better approximating the input data (that is, the SOM with the shortest distance
with respect to the input vector). Once found, it is used to carry out the prediction for the new
input data, based on the distance between the input vector and the closest neuron (the winner
neuron) in the map.
  More formally, let 𝑥𝑖 be the new example to be considered, and let 𝑒(𝑥𝑖 ) = arg min𝑒 𝑑𝑖𝑠𝑡(𝑥𝑖 , 𝑒)
the closest neuron to 𝑥𝑖 according to the distance measure described before, the example is
considered an anomaly if the following inequality holds:

                                 𝑑𝑖𝑠𝑡(𝑥𝑖 , 𝑒(𝑥𝑖 )) > (𝑑𝑎𝑣𝑔 + 𝑡𝑓 ∗ 𝜎 )                           (3)
   In the formula, 𝑑𝑎𝑣𝑔 is the average distance among the training instances and the neurons of
the model after the training, 𝜎 the standard deviation of such distances, and 𝑡𝑓 the user-defined
threshold.
   As data distributions tend to change over time, it may be necessary to update the knowledge
of the anomaly detector using more recent data. For this reason, Spark-GHSOM for anomaly
detection provides the possibility to update the weights vectors of the neurons while keeping the
generated hierarchy unchanged. This process can be particularly useful if end users do not have
enough time or data availability to train a new anomaly detector from scratch. Consequently,
having a pre-trained model already available, it is possible to provide the model with a micro-
batch of data, in order to update the knowledge extracted by the model and adapt it to the user’s
needs. This aspect is particularly useful in our case, where data generated by the sensors can be
relatively few.
   The anomaly detector could produce different types of output depending on the level of detail.
The simplest approach provides feedback for the current data in the form of a Boolean response.
This kind of output could support raising an alert if the response is equal to “anomaly”.
   This approach presents the advantage that is simple to handle and transmits the prediction as
a binary variable (e.g., anomaly/normal, 0/1, true/false). Its drawback is that it makes it difficult
for the end-user to interpret the raised alert/anomaly. Therefore, a more informative approach
could be considered by combining the previous one with a ranking of the variables (feature
ranking) according to their importance, indicating the contribution to catching the variable’s
anomaly.
   Feature ranking is a ranking of the entire set of features composing the data collection, ordered
with respect to the feature importance. Feature importance is a numerical value between 0 and
1, which expresses how anomalous the value expressed by the feature is with respect to the
data collection, such that the sum of all the features importance values in the feature ranking is
equal to 1. The importance score is determined starting from a distance function between the
current data under analysis and the winner neuron. Specifically, the ranking is proportional to
the contribution provided by the single feature in the Euclidean distance between 𝑥𝑖 and 𝑒(𝑥𝑖 ).
More formally, the ranking function for the instance 𝑥𝑖 , 𝑟𝑓 (𝑥𝑖 ), is computed as follows:

                                                  (𝑥𝑖 [𝑙] − 𝑒(𝑥𝑖 )[𝑙])2
                                  𝑟𝑓 (𝑥𝑖 ) =                                                     (4)
                                               ∑𝑙 ′ (𝑥𝑖 [𝑙 ′ ] − 𝑒(𝑥𝑖 )[𝑙 ′ ])2
where 𝑙 represents the feature index.
  This approach helps to identify the feature(s) that most contributed to the anomaly and,
therefore, the ”reason” for the anomaly.


4. Experiments
The experiments were conducted for the city of Oslo (Norway) by considering two real domains
for the following analyses: air pollution and public transport traffic.

Air pollution analysis
The proposed method was tested using data coming from air quality monitoring sensors to
identify pollutant concentrations deemed abnormal. The considered locations within the city of
Oslo are shown in Figure 2.
   At each location, different pollutants are monitored by the sensors:
        • Hjortnes: NO, NO2, NOx, PM10 and PM2.5
        • Loallmenningen: NO, NO2, NOx, PM1, PM10 and PM2.5
        • Spikersuppa: PM10 and PM2.5
The information on the concentration of pollutants comes with both a timestamp and the
geo-coordinates (latitude and longitude), so that the time series can be reconstructed. Data,
which is publicly available, can be downloaded through a REST API 1 .
1
    https://api.nilu.no/
Figure 2: Location of the considered stations in the city of Oslo.


Figure 3: Concentrations per hour of NO, NOx and NO2 pollutants during October 2021, from Hjortnes
station.


   The period considered for training was from January 2021 to September 2021, with an hourly
sampling rate, totalling 18.286 data points from the chosen locations. The period considered for
testing is October 2021, totalling 720 acquisitions from the chosen locations. The best value
for the parameter 𝑡𝑓 has been selected according to internal cross-validation on the training
instances in the interval [0, 15].
   Figure 3 shows the concentrations per hour of NO, NOx, and NO2 pollutants during the
identified test period, i.e., October 2021, from the station of Hjortnes. The choice fell on these
pollutants because they are present within the top-3 of the feature ranking, for those time
instants considered anomalous by the algorithm, indicated with black arrows in the graph.
   It is worth noting that we did not find an abnormal situation on October 21 at 10 a.m.,
indicated with a green arrow in Figure 4, when very high concentrations of PM1 were recorded,
even though at this time point the pollutant PM1 is correctly present in the first position of the
feature ranking.
   The motivation is because several pollutants are being observed together and the sudden
increase of concentrations of one of them is sometimes not sufficient to classify the time instant
as a potential abnormal situation.
   Figure 5 shows the concentration per hour of PM1 pollutant during the test period, from
Loallmenningen. For this place, PM1 is the most decisive pollutant for the detection of abnormal
Figure 4: Concentrations per hour of PM10 and PM2.5 pollutants during October 2021, from Hjortnes
station.


Figure 5: Concentrations per hour of PM1 pollutant during October 2021, from Loallmenningen station.


situations that occurred during October 2021.
   As in the previous graphs, the black arrows indicate the time instants in which we detected
abnormal concentrations of the pollutants considered. As expected, the algorithm was able to
correctly detect high concentrations of the PM1 pollutant.
   On October 26 at 9 p.m., as indicated by the green arrow, the concentrations of PM1 were
very similar to those of October 27 at 4 p.m., however only in the latter case, an anomalous
situation was found by the algorithm. A more detailed graph is shown in Figure 6.
   The reason is due to a sudden increase in concentrations of the remaining pollutants, which
occurred on October 27 at 4 p.m. This situation, as shown in Figure 7, allowed the algorithm
to identify an anomalous situation at this timestamp. Figure 8 shows the concentrations per
hour of PM10 and PM2.5 pollutants during the test period, from the area of Spikersuppa. The
pollutants shown in the graph are the only ones the station can monitor. As expected, the
algorithm did not identify any situations deemed abnormal for this place, as the concentrations
of October are quite regular.
Figure 6: A zoom-in with respect to the time interval for PM1 pollutant during October 2021, from
Loallmenningen station.


Figure 7: Concentrations per hour of NO, NO2, PM10 and PM2.5 pollutants during October 2021, from
Loallmenningen station.


Figure 8: Concentrations per hour of PM10 and PM2.5 pollutants during October 2021, from Spikersuppa
station.


Public transport traffic
This data consists of one week of data regarding Oslo’s public transport. The instances represent
GPS-tracked busses with latitude and longitude. Each instance is timestamped according to the
standard ISO 8601 with a resolution in seconds. The Service Interface for Real time Information -
Figure 9: The data processing pipeline


Vehicle Monitoring (SIRI-VM) is used to model vehicle movements and their progress compared
to a planned timetable 2 .
   For this dataset, the processing pipeline illustrated in Figure 9 was executed. Therefore,
starting from the week of data from Oslo traffic transport, we performed data cleaning in order
to fix some encoding issues. We also aggregated data by 5-minutes interval periods and by
spatial areas according to some preliminary clustering. This step was crucial since the data
provided refer to movable points in the map making the aggregation operations unfeasible.
Clustering on the spatial location was performed by exploiting K-Means algorithm [7]. The
variables of the considered data were extended by considering the cluster identifier (cluster ID)
and the cluster’s centroid latitude and longitude to the data. Since K-Means algorithm needs
the number of clusters to identify, we performed the well-known silhouette cluster analysis [8]
with the aim to identify the number of areas for monitoring the traffic. According to silhouette
analysis, we considered 100 different regions for traffic monitoring (see Figure 10).
   The instances are therefore grouped by two levels: first the time, then the cluster id previously
identified. Various new features are computed as part of the aggregation (e.g., the average
“delay” of the buses in seconds) for each identified clustered monitoring area. Multiple training
and test sets were created as illustrated in figure 11. The 𝑛-th evaluation step uses 𝑛 hours for
training, and the (𝑛 + 1)-th hour for testing. The 10% of the available test windows are perturbed
by randomly selecting 3 columns for each instance and randomly assigning a new value for
each selected feature. These test windows are considered anomalous. The remaining 90% of the
available test windows are used without perturbation and considered non-anomalous for the
evaluation. The aim of this setting is to perform an evaluation based on landmark windows.
The best value for the parameter 𝑡𝑓 has been selected according to internal cross-validation on
the training instances in the interval [0, 15].
   In Figure 12 hour-by-hour histograms are reported for the first day. Stacked green bars
indicate the correct predictions, while the red ones the wrong predictions. The red text in the
2
    https://api.entur.io/realtime/v1/rest/vm?datasetId=RUT
Figure 10: The Oslo street map and the best locations for monitoring traffic according to the clustering
step.


Figure 11: Training and testing sets.


date indicates that the window is perturbed (anomaly). The top label contains the total number
of instances in the test set. During normal windows, the anomaly detector results are effective
since false positives are generally avoided. Most of the normal scenarios that occurred during
different time slots (in the morning, afternoon, evening, and night) were recognized as normal
situations: 99.7% accuracy (we have only 5 false positives at the beginning, when the model is
still unstable). From the figure, we can also see that the system identifies many false negatives at
the beginning [01:40-03:40]. This is expected since the model is still unstable to detect anomalies.
Moreover, the lack of data, due to the lack of public transport late in the night (or early in the
morning, only 48 instances), further complicated the problem. During the day, after 22 hours of
training, the anomaly detector appears to be much more stable and capable to predict most of
the anomalies occurred during the two-hours anomalous time slot [16:40-18:40] in the afternoon.
Figure 12: Hour-by-hour histograms indicating True positives, True negatives, False positives and False
negatives for the first day of data.

                                              accuracy      precision     recall    f1-score
              Oslo public transport traffic    91.33%        99.77%       85.74%     92.22%
Table 1
Oslo public transport traffic: quantitative results in terms of accuracy, precision, recall, and f1-score.


After 26 hours of training, the anomaly detector becomes further stable and capable to predict
most of the anomalies occurred during the anomalous time slot [20:40-21:40] in the evening.
After 28 hours of training, the anomaly detector becomes furthermore stable and capable to
predict most of the anomalies occurred during the anomalous time slot [05:40-06:40] in the
evening/early morning. In table 1, we report the overall quantitative results which confirm
the fact that the algorithm, after sufficient data for training, shows very high prediction scores,
with very high precision.


5. Conclusions
In this paper, we tackle the task of anomaly detection. For this purpose, we extended the
algorithm SparkGHSOM, originally designed for the clustering task, in order to consider the
task at hand. Furthermore, the main algorithm has been made more explainable by providing
the reasons for each detected anomaly in the form of an instance-based feature ranking. The
results show the effectiveness of the proposed approach both qualitatively and quantitatively
in real application scenarios. For future work, we aim to perform further and more robust
experiments with the aim to better evaluate the predictive quality, the explainability, and the
scalability of this new extended version of SparkGHSOM. From an architectural viewpoint,
we aim to provide anomaly detection as an additional service according to the a model-based
approach for Big Data Analytics-as-a-service [9].
Acknowledgment
We acknowledge the project IMPETUS (Intelligent Management of Processes, Ethics and Tech-
nology for Urban Safety) that receives funding from the European Union’s Horizon 2020 research
and innovation programme under grant agreement No. 883286. https://cordis.europa.eu/projec-
t/id/883286. Dr. Paolo Mignone acknowledges the support of Apulia Region through the REFIN
project “Metodi per l’ottimizzazione delle reti di distribuzione di energia e per la pianificazione
di interventi manutentivi ed evolutivi” (CUP H94I20000410008, Grant n. 7EDD092A).


References
[1] A. Malondkar, R. Corizzo, I. Kiringa, M. Ceci, N. Japkowicz, Spark-ghsom: Growing
    hierarchical self-organizing map for large scale mixed attribute datasets, Information
    Sciences (2018). doi:10.1016/j.ins.2018.12.007 .
[2] D. Stojanova, M. Ceci, A. Appice, S. Džeroski, Network regression with predictive clus-
    tering trees, Data Mining and Knowledge Discovery 25 (2012) 378 – 413. doi:10.1007/
    s10618- 012- 0278- 6 .
[3] P. M. Gonçalves Jr, R. S. Barros, Rcd: A recurring concept drift framework, Pattern
    Recognition Letters 34 (2013) 1018 – 1025. doi:10.1016/j.patrec.2013.02.005 .
[4] T. Kohonen, The self-organizing map, Proceedings of the IEEE 78 (1990) 1464–1480.
    doi:10.1109/5.58325 .
[5] C.-C. Hsu, Generalizing self-organizing map for categorical data, IEEE Transactions on
    Neural Networks 17 (2006) 294–304. doi:10.1109/TNN.2005.863415 .
[6] G. D. Kader, M. Perry, Variability for categorical variables, Journal of Statistics Education
    15 (2007). doi:10.1080/10691898.2007.11889465 .
[7] S. Lloyd, Least squares quantization in pcm, IEEE Transactions on Information Theory 28
    (1982) 129–137. doi:10.1109/TIT.1982.1056489 .
[8] P. J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation of cluster
    analysis, Journal of Computational and Applied Mathematics 20 (1987) 53 – 65. doi:10.
    1016/0377- 0427(87)90125- 7 .
[9] D. Redavid, R. Corizzo, D. Malerba, An owl ontology for supporting semantic services in
    big data platforms, in: 2018 IEEE International Congress on Big Data (BigData Congress),
    2018, pp. 228–231. doi:10.1109/BigDataCongress.2018.00039 .