Developing a Decision Support System with a Georeferenced Smart City Security Index (SCSI): A Case Study of Messina Giuseppe Accardo1, *,†, Roberta Marino1,*,† and Valentina Esposito1,*† 1 Data Jam srl, Centro Direzionale Isola F8, Via F. Lauria, Naples, 80143, Italy Abstract With the rapid growth of urban population, cities are facing increasing challenges in terms of mobility, sustainability, and living conditions. Smart cities leverage advanced technologies to improve urban efficiency and citizens' quality of life. This work aims to empower the Public Administration (PA) of Messina, a medium-sized Italian city, with a georeferenced Smart City Security Index (SCSI) to monitor urban security and inform decision-making processes. To achieve this, we trained a Random Forest Regressor using open data alongside territory specific key performance indicators (KPIs) and insecurity indicators. The model assigns a security score from 0 to 100 to each city area, achieving a Root Mean Squared Error (RMSE) of 5.6 on the test set. Furthermore, integrating the model with a Decision Support System (DSS) allows PA members to assess changes in the SCSI in response to adjustments made to the input factors, supporting decision-making. Keywords smart city, open data, decision support system 1 1. Introduction SCIs function by aggregating multiple variables and indicators into a single score, providing a statistical This work aims to leverage Artificial Intelligence (AI) summary of a city's overall performance. Monitoring to develop a specific smart city index for monitoring this score over time allows for evaluation of a city's urban security in Messina, ultimately contributing to progress in achieving its "smart city" goals. a smarter city. Table 1 summarizes some of the most widely The concept of a "smart city" encompasses the recognized SCIs from the literature. integration of technology and urban planning to AI, on the other hand, has become a crucial tool for enhance a city's sustainability, efficiency, and researchers in smart city initiatives. This, coupled innovation. Several Smart City Indices (SCIs) have with the open data movement, has spurred further been developed in the literature to assess and research using these sophisticated techniques to quantify these aspects. These indices typically unlock the potential of data in realizing smart city consider a range of services and projects that goals. contribute to a city's "smartness," encompassing There is some evidence of positive impacts in the areas like public safety (e.g., reduced traffic accidents) transportation, sustainability, or security fields [7][8] and environmental sustainability. [9][10][11][12]. Ital-IA 2024: 4th National Conference on Artificial Intelligence, v.esposito@almaviva.it organized by CINI, May 29-30, 2024, Naples, Italy ∗ Corresponding author. © 2024 Copyright for this paper by its authors. Use permitted under † These authors contributed equally. Creative Commons License Attribution 4.0 International (CC BY 4.0). gi.accardo@almaviva.it; r.marino@almaviva.it; CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings This work aims to equip the Public Administration coordinates to 84% of the previously (PA) of Messina with a tool for monitoring urban unknown locations. Next, we extracted the security and informing decision-making processes. variables of interest by aggregating the data This tool leverages a georeferenced and machine by geometry_id, year and month based on learning-based Smart City Security Index (SCSI) the articles of traffic violation, according to the regulation in Italy. This resulted in the following features: Table 1 Smart Cities Indexes in the literature. • “prov_precedenza” (precedence) Index KPI obtained as the sum of incidents with violations of articles 145 and 150. Arcadis Sustainable Cities Index [1] 20 indicators • “prov_velocita” (speed) considers Innovation Cities Index [2] 162 indicators only the violation of Article 141. ISO 37120 [3] 100 indicators • “prov_posizione” (position) ITU FG-SSC [4] 88 indicators obtained as the sum of articles 154, Networked Society City Index [5] 35 indicators 149, 143, 148 and 144. Siemens Green City Index [6] 30 indicators • “prov_documenti” (documents) as the sum of Articles 80, 193, 116, 180, 126, 94 and 93. • “prov_sosta” (stop) derived as the 2. Materials and Methods sum of articles 158 and 157. This section details the data sources utilized for this • “prov_segnaletica” (signals) study. We describe the steps involved in constructing derived as the sum of incidents the variables that will be employed by the machine with violation of Articles 40, 41 learning (ML) model. Additionally, we present an and 146. overview of the exploratory analyses conducted to Like the approach used for Municipal Police gain insights into the characteristics of the dataset. measures data, we addressed missing The city is subdivided into 287 spatial units (tiles), geospatial coordinates within the Lighting each encompassing an area of 1 km². The SCSI will be Points data. We employed the Nominatim used to assess the security level of each tile over time. open-source API for geocoding, using the It follows that each feature within the dataset must information provided in the "Ubicazione adhere to a specific structure, consisting of a unique toponomastica" (toponomastic location) triad: geometry_id, month, and year. The year and text column. As with the previous data month fields represent the reference time, while the source, text cleaning procedures were geometry_id field uniquely identifies a tile. necessary prior to geocoding, leveraging NLP techniques. This process successfully 2.1. Open data assigned geographic coordinates to 78% of We utilized open data from the city of the locations where coordinates were Messina, which are described in the previously missing. Next, the feature of following section. interest, namely the number of public Municipal Police measures gather data on lighting poles present in a certain time tile accidents involving traffic violations. (“n_pali_luce”), was calculated by summing As an initial data preprocessing step, we the poles falling by geospatial coordinates in addressed missing geospatial coordinates. the analyzed tile. We leveraged the Nominatim open-source Urban Video surveillance details the closed- API [13] to geocode these locations using the circuit television (CCTV) system operating information provided in the "Luogo within the Municipality. The data concern Incidente" (incident location) text column. only administration-owned cameras, all of Prior to geocoding, the text data underwent which are georeferenced, and have no cleaning procedures using natural language missing values. Here, the variable of interest processing (NLP) techniques. This process is the number of cameras present in a successfully assigned geographic specific time tile (“n_telecamere”). We obtained this value by summing the CCTVs quantitative index has a constant weight of that fall within the analyzed tile, based on 1. Values of the target variable range from 0 their geospatial coordinates. (lowest security) to 100 (highest security). Figure 1 illustrates that for specific month 2.2. Digital exhaust data and year, the target variable often takes the value of 100, which corresponds to the For the construction of the features, in highest security level. Furthermore, as addition to the open data, we derived the shown in Figure 2, the distribution of the following geolocated indicators that can target variable, considering the entire characterize tiles in the city of Messina. dataset, exhibits a significant imbalance, The “sentiment” index is a measure of with the value 100 being the most frequent sentiment calculated on online content from by a considerable margin. the analysis period within the selected tile. It To further explore the distribution of the ranges from 0 to 100. target variable, we visualized it after The “footfall” score is an absolute, and excluding tiles with the highest security unlimited index that measures the foot level (value 100). As shown in Figure 3, the traffic and popularity of a tile. This indicator remaining values exhibited a wider range, considers various factors, such as the suggesting a more informative distribution number of geolocated reviews, content on for analysis. social media and aggregated and Nevertheless, it was necessary to consider anonymized data originated from mobile how to correct the imbalance in the values devices. assumed by the target. The remaining features: "degrado" To understand the cause of this imbalance, (degradation), "incendio" (arson), we examined the features associated with "incidente" (accident), and "crimini" tiles having the highest “Security_Target” (crimes), sum up the number of events (value 100). Interestingly, we discovered linked to each of these categories per tile, that 7812 records possessed identical year, and month. We collected this features. In all these cases, the feature values information by web-scraping from open and were either 0 (indicating no events like for licensed/authorized closed sources such as instance arson) or NaN (meaning data on websites blogs, social media and Police. factors like footfall and sentiment was 2.3. Data Preparation unavailable). Due to these missing or non- informative features, we opted to remove After integrating the data described in the these duplicate rows. previous sections into a single table, we We obtained a dataset with 4816 records, obtained a dataset with 12628 records, each 3398 of which were with target 100. representing a unique triad of geometry_id, Following the initial data exploration, we month, and year. analyzed the prevalence of missing values The dataset refers to the time frame January across all features (percentages shown in 2019-August 2022, extremes included. Table 2). To address this issue, we excluded We then proceeded to analyze the content of observations where both sentiment and this dataset, focusing initially on the target footfall data were missing. This exclusion variable for the machine learning model, step resulted in a dataset of 4654 records. namely the "Security_Target". Subsequently, the data was split into This variable, is a weighted average of a training and test sets. The training set qualitative and a quantitative index, representing the security level of each tile. The qualitative index considers the sentiment of online reviews related to security falling within each tile, while the quantitative index reflects the number of crimes committed. The qualitative index is weighted by the number of reviews in each tile, normalized between 0 and 1, while the comprised 3257 records, while the test set Feature Percentage of contained 1397 records. missing values prov_precedenza 0 prov_velocita 0 prov_posizione 0 prov_documenti 0 prov_sosta 0 prov_segnaletica 0 n_telecamere 0 n_pali_luce 0 sentiment 3.36 footfall 3.36 degrado 0 incendio 0 incidente 0 Figure 1: “Security_Target” distribution in Messina. crimini 0 This figure depicts the spatial distribution of the target. Color intensity is used to represent the “Security_Target” value, with light yellow indicating 3. Results areas with the highest security level and dark red indicating areas with the lowest security level. This section details the ML model which was selected to compute the SCSI. This is a random forest regressor from the library scikit-learn, whose hyperparameters are indicated in Table 3. Analyzing the performance metrics of the ML model in Table 4, the residuals in the test set in Table 5 and the distribution of observed and predicted values in Figure 4 we assessed its goodness. Having established the validity of the chosen model, we proceeded to analyze the impact of each feature on the target variable. Shapley Additive exPlanations (SHAP) values provide a useful graphical representation of these feature importances [14]. A beeswarm plot effectively visualizes the distribution of SHAP values, highlighting the features that exert the Figure 2: “Security_Target” Histogram. strongest influence on the model's predictions. Our analysis in Figure 5 reveals that the "degrado" feature has the greatest impact. High values of "degrado" (represented by red in the beeswarm plot) are associated with a lower SSCI, and vice versa. Similarly, the "n_pali_luce" feature is the second most important, with lower values corresponding to a reduced SSCI. This analysis of feature importance provides key insights into the behavior of the decision-support system (DSS). Following model development, we equipped the Public Administration of Messina with a DSS that enables them to simulate the impact of changes in the SSCI by modifying features within selected city tiles (see Figure 6 and Figure 7). In Figure 3: “Security_Target” with values less than essence, these features function as controllable 100. Histogram. parameters that can be adjusted to improve the security level in specific areas. Table 2 Percentage of missing values Building on a similar approach, we developed a georeferenced green index (GI) for the PA of Messina (see equation (1)). This index assigns a score between Hyperparameter Value 0 and 100, quantifying the overall quality and quantity n_estimators 100 of urban green space for each spatial unit. Similar to the SCSI, the green index is designed for integration oob_score True with a DSS (see Figure 8 and Figure 9). However, criterion 'squared_error’ unlike the SCSI, it does not employ machine learning max_depth None techniques. random_state 0 Below the expression to calculate the GI: max_features None min_samples_split 6 𝐻𝐺𝐴 + 𝑇𝐶𝐴 ∗ 𝛼 𝑤1 ∗ 𝑈𝐺 + 𝑤2 ∗ ( ∗ 100) (1) 𝐺𝐼(𝑡𝑖𝑙𝑒) = 𝐸𝐿𝐴 𝑤1 + 𝑤2 Table 4 Explanation of variables: Performance metrics for the Random Forest Regressor, namely MAE (Mean Absolute Error), MSE 1. UG (Urban green perception index): This (Mean Squared Error), and RMSE (Root Mean Squared index reflects the perceived quality and user Error). The Validation errors represent the mean of experience of urban green spaces, derived errors calculated during the 5-Fold cross-validation process. from analyzing online reviews. 2. HGA (Horizontal green area, m2): Measure Train Validation Test Represents the area of gardens, parks, and (mean) forests within the spatial unit. MAE 1.01 2.08 1.78 3. TCA (Tree canopy area, m2): Calculated as MSE 9.76 40.11 31.09 the sum of canopy area for all trees in the RMSE 3.12 6.28 5.58 spatial unit. 4. ELA (Emerged land area, m2): Represents the total land area excluding water bodies Table 5 within the spatial unit. Distribution of observed, predicted values and 5. α (Weight relative to the vegetative state of residuals considering data in the test set. Residuals are the difference between observed values and the canopy area): Derived from Visual Tree predicted values. Assessment (VTA) data. It is calculated as the weighted sum of the areas of tree crowns Value observed predicted residual within a tile, adjusted for their vegetative count 1397 1397 1397 state, divided by the total area of all tree min 0 0 -40.99 crowns in the tile. 25% 97.04 97.39 0 6. w1 and w2: Weights assigned such that the 50% 100 99.95 0 quantitative dimension (HGA and TCA) 75% 100 100 0.39 contributes twice as much as the qualitative max 100 100 80.04 dimension (UG) to the overall GI score. Overall, this project demonstrates the value of data- driven approaches in urban planning. The SCSI and DSS empower the PA to make informed decisions regarding security, and the future integration of machine learning into the Green Index holds further promise for comprehensive urban management. Table 3 Hyperparameters for the Random Forest Regressor Figure 4: Distribution of observed and predicted values in the test set. Figure 8: Example of the GI implemented in the Municipality of Messina. Empty tiles represent areas with missing data for the municipal tree inventory. The number of tiles displayed will increase as the Figure 5: The "beeswarm" graph for the Random census continues. Forest regression related to the Smart Security City Index. Figure 9: Example of DSS application (urban green condition). References [1] Arcadis. (2022, June 21). The Arcadis Sustainable Cities Index 2022. [Member Spotlight]. Retrieved Figure 6: Example of an implementation of the from https://www.arcadis.com/en/knowledge- SCSI in the Municipality of Messina. Empty tiles hub/perspectives/global/sustainable-cities- indicate areas with missing data for footfall and index. sentiment and the remaining features equal to 0. [2] 2thinknow. (2023). Innovation Cities™ Index. Retrieved from https://innovation- cities.com/worlds-most-innovative-cities-2022- 2023-city-rankings/26453/ [3] International Organization for Standardization. (2018). ISO 37120:2018 Sustainable development of communities - Indicators for city services and quality of life. [4] International Telecommunication Union (ITU). (n.d.). The Telecommunication Standardization Sector (ITU-T). Retrieved from https://www.itu.int/en/ITU- T/Pages/default.aspx [5] Ericsson. (n.d.). Networked Society City Index. Figure 7: Example of DSS application (security) Retrieved from https://www.ericsson.com/en/reports-and- papers/networked-society-insights [6] Siemens AG. (n.d.). Siemens Green City Index. Retrieved from https://assets.new.siemens.com/siemens/asset s/api/uuid:cf26889b-3254-4dcb-bc50- fef7e99cb3c7/gci-report-summary.pdf [7] Agarwal, P. K., Gurjar, J., Agarwal, A. K., & Birla, R. (2015). Application of artificial intelligence for development of intelligent transport system in smart cities. Journal of Traffic and Transportation Engineering, 1(1), 20-30. [8] Bharadiya, J. (2023). Artificial intelligence in transportation systems a critical review. American Journal of Computing and Engineering, 6(1), 34-45. [9] De Las Heras, A., Luque-Sendra, A., & Zamora- Polo, F. (2020). Machine learning technologies for sustainability in smart cities in the post-covid era. Sustainability, 12(22), 9320. [10] Hassan, S. I., & Agarwal, P. (2020). Analytical approach to sustainable smart city using IoT and machine learning. In Big Data, IoT, and Machine Learning (pp. 277-294). CRC Press. [11] Lourenço, V., Mann, P., Guimaraes, A., Paes, A., & de Oliveira, D. (2018, July). Towards safer (smart) cities: Discovering urban crime patterns using logic-based relational machine learning. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE. [12] Butt, U. M., Letchmunan, S., Hassan, F. H., Ali, M., Baqir, A., Koh, T. W., & Sherazi, H. H. R. (2021). Spatio-temporal crime predictions by leveraging artificial intelligence for citizens security in smart cities. IEEE Access, 9, 47516-47529. [13] OpenStreetMap contributors, "Nominatim," OpenStreetMap wiki, 2023, https://nominatim.openstreetmap.org/. [14] Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 4768–4777.