Challenges for Air Pollution Monitoring: a Cyber-Physical Social Systems Approach Marco Zappatore Sergio Refolo Antonella Longo Hesplora s.r.l. & University of Dept. of Innovation Engineering Dept. of Innovation Engineering Salento, Lecce, Italy University of Salento University of Salento marco.zappatore@hesplora.it, Lecce, Italy Lecce, Italy marcosalvatore.zappatore sergio.refolo@studenti.unisalento.it antonella.longo@unisalento.it @unisalento.it ABSTRACT Similarly, citizens must be made aware of air quality status and how to be more actively involved in actions aimed at improving Air pollution control plays a pivotal role today in urban contexts, daily life quality conditions. as both citizens and public administrators are increasingly The attention on environmental issues is continuously increasing sensitive about it. Traditional air pollution sensing is performed and it involves more and more people: this results in a rising and managed by public institutions with professional and number of in-domain researches. However, people hardly can take expensive equipment, thus exhibiting a series of inherent any conclusion on the topic by themselves and usually the correct limitations such as isolated monitoring campaigns, data interpretation of research findings is troublesome for non- heterogeneity, inconsistency and incompleteness, limited access to professional recipients [3]. In the most common scenario (usually sensed data. Cyber-Physical-Social Systems (CPSS) promise to be known as institutional monitoring), professional and expensive a considerable step forward, as they promote the systematic sensors are placed in the close proximity of few significant areas involvement of citizens in monitoring processes and the (e.g., airports, hospitals, congested roads, etc.) by authorized provisioning of proactive services to end users. However, several agencies or public bodies devoted to environmental control on elements hinders such a model. In this paper, we will discuss the national, regional or even smaller scale. Raw data are collected challenges of applying cyber-physical paradigm to air pollution and published online by the same agencies. This approach is monitoring in smart cities, exemplifying the issues on the Italian worldwide adopted and falls under the definition of air quality case study and then we will show how CPSS will go over them assessment [4]–[8]. Published data come directly from sensors and outline novel research directions. (i.e., raw data) or from simple data manipulation processes and usually no inferred knowledge is provided in a simple and KEYWORDS effective way, especially when data sources and data formats Cyber-Physical Social Systems, Air Monitoring, Data Processing, differ significantly. It is, therefore, widely accepted that dedicated Data Visualization, Mobile Crowd Sensing. data processing solutions are needed in order to clean data from unwanted noise, thus focusing on what really matters [9], [10]. This implies the need of monitoring outcomes effectively 1 Introduction presented to final users, in order to provide meaningful insights to Air quality monitoring is a strategic and long-term activity that the different involved actors, as citizens’ needs differ from those gives experts the opportunity to make evaluations about air exhibited by city administrators. pollution, to study emission causes and sources as well as to Cyber-Physical-Social Systems (CPSSs) promise to be a valuable develop corrective or mitigation plans. The air quality status cast solution for urban monitoring scenarios as they leverage on the several concerns amongst experts as well as citizens due to its availability of scores of heterogeneous sensors whose readings are related health risks [1], [2]. Therefore, it is evident how air collected, aggregated and analyzed by cyber processes and pollution control is necessary to prevent human diseases and to profitably merged to real-time, city-related data provided and protect ecosystems. That is why it must be addressed by local shared by complementary social sources in order to be presented authorities and policy makers, as well as it should be a as relevant information to citizens and authorities [11]. However, responsibility for the stakeholders in the industrial sector. this paradigm is far to be applied on a large scale. In this paper we will focus on the Italian scenario, by examining the existing 1st Workshop on Cyber-Physical Social Systems (CPSS2019). solution and by proposing a first step towards the adoption of October 22, 2019, Bilbao, Spain. CPSSs. Currently, the Italian situation features traditional air Copyright © 2019 for this paper by its authors. Use permitted under pollution assessments based on local sensing stations that, even if Creative Commons License Attribution 4.0 International (CC BY 4.0). reliable and properly manned, do not guarantee a wide coverage of monitoring campaigns (due to high costs and lack of skilled personnel) and expose several data heterogeneity issues, thus 38 Challenges for Air Pollution Monitoring: CPSS2019, October 22, 2019, Bilbao, Spain a Cyber-Physical Social Systems Approach making difficult any data comparison and aggregation on a wider 2 Air Pollution: the Current Scenario scale. For such a reason, in this paper we will thoroughly examine air 2.1 Legislation in Europe pollution monitoring data provided by Italian regional agencies Air pollution is an important environmental and societal issue that for environmental protection. A proper data model will be devised impacts on human health, ecosystems and climate changes. in order to aggregate data coherently. Data manipulation pipelines Several official reports have addressed so far this topic, trying to will be applied to collected data in order to aggregate and to propose regulations to be applied on large scale. For instance, the visualize them properly with the help of business intelligence 2016 report of air quality in Europe [12] focuses on the scenario tools. This procedure highlights data incompleteness and in the EU Member States. It shows that a large portion of the heterogeneity coming from institutional sources. For partly European population (as well as the ecosystems in the same overcoming the issue we propose a CPSS that is currently under region) is exposed to air pollution levels that exceed European development in the framework of an Italian regional research standards and World Health Organization (WHO) Air Quality project, aimed at large-scale, low-cost urban environmental Guidelines (AQGs). pollution monitoring. The most significant provenance of air pollutants is represented The paper is structured as follows: section 2 introduces the by anthropogenic sources. They encompass transportation domain of our investigation and the corresponding research systems, industry, power plants, agriculture machineries and questions. In section 2 the addressed scenario is described in household appliances. detail. Section 3 deals with Cyber-Physical-Social Systems. Regardless of their origin, air pollutants can be divided into two Section 4 describes the addressed scenario while Section 5 shows main categories: primary and secondary ones. Primary pollutants our data analysis approach. Achieved results are discussed in are directly released into the environment from the processes that Section 6, along with the proposed CPSS modelling. Finally, generate them. The main pollutants belonging to this class (e.g. Section 7 draws conclusions. CO, NOx, SOx) are the result of combustion processes. Secondary pollutants derive from primary ones, and are obtained Table 1: European legislation about emissions Pollutants* Policies SO2 NO2, NOx, BaP / PM O3 , CO Heavy metals VOCs NH3 PAH SOx Directives 2008/50/EC (EU, SO2 PM O3 NO2, NOx CO Pb Benzene regulating 2008) ambient 2004/107/EC (EU, As, Cd, Hg, Ni BaP air quality 2004) (EU) 2015/2193 SO2 PM NOx (EU, 2015) 2001/81/EC (EU, SO2 NMVO NOx, NH3 2001) C Cd, Tl, Hg, Sb, 2010/75/EU (EU, SO2 PM NOx, NH3 CO As, Pb, Cr, Co, VOC Directives 2010a) Cu, Mn, Ni, V regulating European standards VOC, emissions on road vehicle PM NOx CO NMVO of air emissions C pollutants 2012/46/EU (EU, PM NOx CO HC 2012) 94/63/EC (EU, VOC 1994) 2009/126/EC (EU, VOC 2009c) *Pollutants: PM: Fine particles; O3: Ozone; NO2: Nitrogen dioxide; NOx: Nitrogen oxides; NH3: Ammonia; SO2: Sulphur dioxide; SOx: Sulphur oxides; CO: Carbon monoxide; CO: Carbon monoxide; BaP: Benzo[a]pyrene; PAH: Polycylcic Aromatic Hydrocarbon; VOC: Volatile Organic Compound; NMVOC: Non-Methane VOC; HC: Hydrocarbons; Pb: Lead, As: Arsenic; Cd: Cadmium; Co: Cobalt; Cr: Chromium; Cu: Copper; Hg: Mercury; Mn: Manganese; Ni: Nickel; Sb: Antimony; Tl: Thallium; V: Vanadium. 39 CPSS2019, October 22, 2019, Bilbao, Spain M. Zappatore, S. Refolo and A. Longo from their transformation due to reactions usually involving decree D.Lgs.155/2010, defines how to evaluate and manage air oxygen and light: oxidation is therefore a phenomenon strictly quality for human health defense and environment protection. correlated to this pollutant’s category. In Table 2 we summarize the currently enforced D.Lgs.155/2010: More specifically, PM (particulate matter), BaP (benzo[a]pyrene) it presents pollutant concentration, reference averaging period, and mercury (Hg) emissions come from the incomplete legal nature of the specific norm enlisted, permitted exceedances combustion of various fuels, while emissions of ammonia (NH3) per year and limit values for each pollutant. or CH4 (methane) from agriculture. The current trend about PM In Italy air quality monitoring is decentralized and performed foresees threshold exceedances even in 2020: PM with a diameter autonomously by regional or local agencies for environmental of about 10 µm (henceforth, PM10) exceeds the EU limit value in protection: each agency deals only with its own territory. 21 of the 28 EU Member States, while PM 2.5 (i.e., particles These agencies, named ARPA (whose acronym stands for whose diameter is nearly 2.5 µm) exceeds on average in 4 states Regional Agency for Environmental Protection, in Italian) are [13]. public institutions that provide technical support to Italian The transport sector and the industry have been taking a regional administrations (except for Trentino-Alto Adige, which considerable reduction of their emissions of air pollutants in has been split into the two autonomous provinces of Trento and Europe since 2000 (except for BaP and Cadmium, Cd, emissions Bolzano) to perform environmental control and enforce in transports, and CH4 and BaP in industry). The trend of regulations. commercial, institutional and households’ emissions is less These agencies, born in 1993 and nationally coordinated by SNPA positive, with a 3% increase in BaP from 2000 to 2014. Moreover, (The National System for Environmental Protection, in Italian), less significant reductions of air pollutants have been experienced are nationwide dedicated to yearly environmental quality in agriculture. assessments. On the one hand, the decentralization in local In Table 1 the most relevant European directives concerning air agencies implies detailed control over a relatively limited portion pollution are reported. of the national territory. On the other hand, however, this causes The main goal of monitoring campaigns is providing indicators to heterogeneity across the different regions due to the lack of shared define emissions trend; the following list collects the main data format and collection, management and publication policies. indicators used in national monitoring campaigns with the related As a consequence, even if the agencies apply the same reference directives [14], [15]: environmental control methodologies and comply with the same 1. Greenhouse gases (CO2, CH4, N2O) – Framework regulations, citizens experience different air-pollution-related Convention on climate change (1992) ratified with L 65 of monitoring services and tools depending on the agency they refer 15/01/94; Kyoto Protocol (1997) ratified with L 120 of to. Moreover, different regions present different levels of detail 01/06/02; CIPE resolution 19/12/02; D.Lgs. 51/08; D.Lgs. n. about information offered by their environmental agencies and 30 13/03/13 this makes difficult to compare directly data coming from 2. Acidifying substances (SOx, NOx, NH3) – Goteborg different locations. Protocol (1999); NEC (2001/81/CE) directive; D.Lgs. 171/04 This scenario does not facilitate the analysis of the overall Italian 3. Particulate – LCP 2001/80/CE directive; CE 715/2007 regulation; CE 595/2009 regulation pollution scenario. Indeed, it is not possible to carry out this task 4. Carbon monoxide (CO) – D.Lgs. n. 152 of 03/04/2006; properly without any technical knowledge needed to overcome the 97/68/CE directive; 98/77/CE directive technical issues briefly sketched above. The support of a software 5. Benzene (C6H6) – L 413 of 04/11/97 application capable to normalize and integrate different sources is, 6. Persistent organic pollutants (IPA) – Aarhus Protocol (1998); at present, fundamental in order to make readable and L 125/06 understandable huge amount of available data merged from 7. Heavy metals – Aarhus Protocol (1998) several monitoring agencies. This, in addition to the possible presence of supporting and complementary data sources provided Humans can be adversely affected by exposure to air pollutants in by citizens, would be the ideal scenario for the implementation of ambient air. In response, the European Union has produced an the CPSS paradigm. However, such a scenario is still far to come. extensive body of legislation which establishes health-based Table 2: D.Lgs. 155/2010 standards and objectives for several air pollutants. These C* objectives are developed over different periods because pollutants P* Tavg* TVED*/LVED* AE* [µg/m³] impact human health in different ways according to exposure time TVED: 1.1.2010 (we refer the interested reader to the existing-legislation section PM 2.5 25 1Y⁑ n/a LVED: 1.1.2015 related to air quality in the EC Web portal [16]). 350 1h LVED: 1.1.2005 24 SO2 2.2 Legislation and Environmental Control Agencies in 125 24h LVED: 1.1.2005 3 the Italian scenario 200 1h LVED: 1.1.2010 18 In this paper our analysis is focused on the Italian situation: the NO2 2008/50/CE directive, implemented in Italy with the legislative 40 1Y LVED: 1.1.2010 n/a PM10 50 24h LVED: 1.1.2005 35 40 Challenges for Air Pollution Monitoring: CPSS2019, October 22, 2019, Bilbao, Spain a Cyber-Physical Social Systems Approach 40 1Y LVED: 1.1.2005 n/a monitoring stations for long time periods (at least 6 months). These stations are sometimes relocated to other sites, due to their Pb 0.5 1Y LVED: 1.1.2005⁂ n/a limited number. Large amounts of collected raw data are made Max openly available as daily or annual datasets in (semi-structured) CO 0,010 LVED: 1.1.2005 n/a 8h text formats such as .csv, .xls(x) or .json. Benzene 5 1Y LVED: 1.1.2010 n/a Data heterogeneities affect the Italian scenario as well: regional Max 25d/ environment control agencies do not share a common data Ozone 120 TVED: 1.1.2010 publication format and do not comply with a unified template for 8h 3Y *P: Pollutant name; C: Pollutant concentration; Tavg: Averaging period; publishing data. Each agency publishes validated data on a TVED: Target Value Enforcement date; LVED: Limit Value Enforcement daily/weekly basis on its own Web portal but adopts different data Date; AE: Permitted exceedances each year. visualization strategies and offers a variable set of tools for data ⁑ : Y: Year; h: Hour; d: Day; Max 8h: Maximum daily 8 hour mean. manipulation, ranging from simple data filtering to customized ⁂ : or 1.1.2010 in the immediate vicinity of specific, notified industrial chart composition. Data granularity is inconsistent as well, as in sources; 1.0 µg/m³ limit value applied from 1.1.2005 to 31.12.2009. some cases users can access single-day datasets while larger datasets are available in other cases, thus determining critical gaps 2.3 Monitoring Networks and Data Availability in user experience. Air monitoring is a long-term activity and it requires necessarily The lack of a common standard hinders the chance of joint careful studies. Usually, a monitoring network (i.e., a set of analysis: inconsistency between data formats, data structure or monitoring stations positioned in places of interest which provides detection metrics affect research potentials and limit non- some measures) is required. Monitoring stations record data about professionals from acquiring environmental awareness. pollutants concentration in the lower atmosphere: through specific However, as pointed out throughout the text, the most significant tools they perform measurements summarized in indicators, which issue affecting the Italian scenario is represented by the absence of are useful to make comparisons with limit values defined by an institutional unified platform allowing users to access, navigate directives and to know whether the situation is safe or not. and manage monitoring data on a national scale. In [17] the EU scenario in terms of air quality monitoring is From a legislative perspective, a federal council of Italian regional reported: monitoring campaigns are usually performed all year environment control agencies has been established in 2016 and a long with urban/local or regional scope. Monitoring stations are national air information system (SINAnet) [22] has been categorized into traffic, urban industrial or rural industrial established. However the council only promotes administrative locations. While there is a substantial homogeneity in these cooperation amongst agencies and the national information aspects amongst EU countries, data availability and data reporting system is not open to the public yet. Indeed, at the moment of differ significantly amongst Member States. As for data writing this paper, the system is accessible only by authorized availability, the following categories can be identified: 1) personnel from regional agencies (i.e., ARPAs). validated data available for authorities only; validated data available for the public after a time delay (normally 1 day for data validation procedures); non-validated data available for the public 3 CPSSs for environmental monitoring in real-/near-time. Data reporting is also variegated: in some Cyber-Physical Social Systems (CPSS) are rooted into Cyber- countries it is not performed on a nationwide scale, in some Physical Systems (CPS) and Cyber-Social Systems (CSS) [11]. others, instead, annual reports are published by environment Therefore, CPSS are made up of multiple layers of sensors and control agencies. actuators capable of monitoring physical phenomena and people’s However, data are sometimes incomplete and not certain. For actions and of cyber components capable of receiving sensor data instance, 15 EU Member States reported uncertainty in their and generate digital representations of the monitored world (i.e. emission estimations and, in 2014, nearly 33% of data was the digital twins), so that specific actions can be implemented incomplete [18], [19]. In this context, therefore, proper data accordingly. Sensing layers are usually populated by IoT (Internet cleaning and management operations become essential in order to of Things) sensors, mobile devices, and WSNs (Wireless Sensor make data usable and to minimize errors [20]. As a consequence, Networks) that provide time-referenced and geo-referenced existing approaches to air pollution monitoring leverage datasets. In addition to them, social data streams are managed, as significantly on big data and data mining solutions. well. Therefore, CPSS represent an evolution of IoT applications Several actions are underway in order to cope with this scenario, and are based on the integration of physical, cyber and social such as the Copernicus Atmosphere Monitoring Service (CAMS), spaces, so that new knowledge can be inferred and the interactions implemented by the EU Centre for Medium-Range Weather with humans can easier happen. The core idea is that Forecast (ECMWF) [21], aimed at reducing air pollution effects heterogeneous data sources from the physical world are fed to and the concentration of toxic breathable elements. data processing and analytics processes, thus enabling further data In Italy, monitoring campaigns are performed in sensitive fusion procedures whose output can be used by end-user locations (e.g., high-density traffic hotspots, airports, schools, applications, as described in the so-called data-oriented CPSS downtown areas, industrial sites, etc.) by positioning fixed functional architectural model [23], where a CPSS solution for a 41 CPSS2019, October 22, 2019, Bilbao, Spain M. Zappatore, S. Refolo and A. Longo urban scenario is described as a set of “data sourcing, collection have developed a solution for collecting, managing and and analysis mechanisms in order to obtain city intelligence”. visualizing data. We have analyzed a 5-year range (from 2013 to More specifically, in [23], the authors consider a CPSS as built on 2017) by referring to standardized pollutants only (i.e., C6H6, top of three core elements. The first one is represented by CO, NO2, O3, SO2, PM10, PM2.5). The following subsection collaborative sensing sources, operating according to multiple will deal with the dataset. sensing paradigms but sensing the same physical contexts. This element, therefore, not only consists of traditional WSNs and IoT 4.1 Referred Dataset nodes but also of “smartphone-carrying citizens” who become Initially, all Italian regions were considered for the analysis: this “valuable sensing resources”. The second core element is given allowed us to sketch the overall scenario and to identify by data analysis tools, needed in order to highlight any existing differences in the way regions perform the same task. The very spatial/temporal or content-related pattern (or correlation) first aspect is that, despite the availability of the federal council amongst datasets from different sources in order to increase and of SINAnet platform (see Section 3), the accessibility and context awareness. The third element is provided by cross-spatial availability of monitoring data vary depending on the region, thus data fusion tools, which are in charge of mining collected making troublesome to perform analyses and comparisons on a multimodal datasets and cope with heterogeneous measurement national scale. We used data available online via the regional scales, combination of quantitative variables and qualitative ARPA portals. classifications, etc. For this reason, data integration is crucial, in order to merge files Several CPSS solutions based on this model have been proposed from different sources and define a shared and common data in the recent years, addressing a wide range of applications. The format. studies that specifically tackled urban environmental monitoring At the starting point the count of overall data spanned across a can be clustered depending on the targeted application. For time window from 2010 to 2018 and amounted nearly 71M instance, the urban noise mapping problem has been addressed in records. The overall dataset exhibited a significant heterogeneity [24] by adopting a fixed and mobile sensing infrastructure, in terms of data granularity, format and structure. Therefore, for enriched via participatory sensors by users, but no data fusion the sake of this cases, we selected a subset of sources in order to solutions have been proposed. The air quality assessment has been skim raw data before cleaning and to keep only the most analyzed in [25], considering social data sources only (as the homogeneous ones. This decision consisted in selecting only adopted CPSS infrastructure was fed by tweets from citizens those regions that provide records referred to: 1) the five-year about perceived air pollution levels), and in [26], distributing period 2013-2017 (because other years had less available data): 2) sensors only across communities of people, rather than to a large fundamentals pollutants (i.e., the standardized ones: C6H6, CO, portion of citizens. Other CPSS approaches have been applied in NO2, O3, SO2, PM10, PM2.5). Moreover our analysis evaluates Santander, Spain [27], where large IoT networks were deployed only regions whose measurements have been collected in for environmental participatory sensing and car parking compliance with regulatory limits. For instance, in the case of management, but no advanced data processing and data fusion Lazio region, data were available in annual metrics, while solutions were proposed. pollutant metrics must be computed daily, according to In the following sections, we will talk about the case of Italy, regulations. This aspect poses a severe incompatibility among which has allowed us to identify the most significant challenges in sources having different record granularity. managing environmental monitoring data on national scale Such a preliminary record filtering operation has reduced hindering the adoption of a CPSS approach and, subsequently, we significantly, the size of the initial dataset, by moving from 71M will introduce a proposal for a CPSS platform dedicated to urban to 32M records. Selected regions are nine out of twenty: pollution control. Basilicata, Campania, Emilia-Romagna, Lombardia, Marche, Puglia, Sicilia, Toscana, Valle D’Aosta. 4 Case Study 4.2 Data quality issues In order to identify current challenges in environmental According to the definition of data quality dimension clusters monitoring in Italy, as introduced in Section 1, we have defined a depicted in [28], the main issues faced in managing data coming nationwide case study about air pollution and, subsequently, we from regional agencies have been: 42 Challenges for Air Pollution Monitoring: CPSS2019, October 22, 2019, Bilbao, Spain a Cyber-Physical Social Systems Approach • Completeness: data format heterogeneity and sparsity in terms of time (i.e., regional datasets cover different time periods) and types (i.e., regional datasets refer to different subsets of pollutants). Therefore we have been forced to focus on 2013-2017 time period and on 7 standardized pollutants only. Moreover records outside the considered time range and not referring to the considered subset of standardized pollutants were discarded, totally or in part. For instance, regional datasets like those from Apulia and Lombardy regions were skimmed in order to keep only those subsets of data that complied with referenced intervals. A special concern was related to CO: for this pollutant, the majority of the analyzed regions have used an hourly average metric, while the regulation requires an 8-hour moving average instead: it has been decided to use for Figure 1: Dimensional Fact Model this pollutant a metric functional to datasets, namely an environmental pollution can significantly improve environmental hourly average metric. Therefore, CO records with 8- awareness across the population. For such reasons, we have hour moving average metric were removed through designed and implemented a platform capable of merging proper filtering heterogeneous data about air pollution, cleaning them and • Redundancy: some datasets included multiple versions visualizing them in a meaningful and effective way, by using a of the same type of record (i.e., same measurement with few dashboards. different type of metrics): therefore we had to select only one metric per record set (according to the 5.1 Data model corresponding regulation) in order to cope with data A specific data model is behind the tool, so that data processing redundancy. Datasets from Campania, Sicily and Valle and visualization tasks can be performed rigorously and D’Aosta were the ones affected by data redundancy the coherently. We refer to the Dimensional Fact Model (DFM), most. which has been proposed by Golfarelli et al. [30] specifically to • Accessibility (and the corresponding access time): some support data mart design. This conceptual representation consists regions provide datasets through the website of their of a set of fact schemata that basically model the analyzed domain environmental control agencies and other regions in terms of facts (i.e., any concept describing a time-evolving provide data via open data portals. In addition, for entity relevant to decision-making processes), dimensions (i.e., regions like Abruzzo, Liguria or Sardinia, it is very hard any qualitative description of a fact, composed by dimensional to compose a 1-year dataset that refers to all monitored attributes), measures (i.e., any numerical property or calculation pollutants. Indeed, it is possible either to download about a fact) and hierarchies (i.e., any directed tree made up of records on a daily basis, referring to all the pollutants or dimensional attribute). to download annual datasets referring to a single In our case study, the measurement fact is chosen as the most pollutant at a time. This leads to download scores of significant one (Figure 1). The combination of three dimensions different files manually. In some cases datasets were not (time, location and pollutant type) results in multiple potential accessible directly and data owners (i.e., the views of the same fact, so that it can be examined from multiple corresponding environment control agency) were perspectives. Several measures can be associated to the defined contacted with no official answer. fact (e.g., number of threshold exceedances for each parameter, average value sensed during a given time window for a given parameter in a given province, etc.), so that effective numerical 5 The Analysis Platform indicators can be then derived and implemented into visualization From the issues presented above non-professional users are dashboards. prevented to extract meaningful insights from such little- comparable [29] data without any technical help. To make this 5.3 Data Processing data effective researchers need platforms supporting the analysis Transformations of raw data are fundamental in order to reconcile without dedicating excessive time and computational resources to data provided by regional environment control agencies. data preparation and non-professional users must be supported Technically ETL pipeline has been developed using Pentaho even in accessing data and then guided across data visualization Community Edition, an open source ETL (Extraction, options, as publicly accessible and easy-to-understand data on Transformation and Loading) platform [31]. 43 CPSS2019, October 22, 2019, Bilbao, Spain M. Zappatore, S. Refolo and A. Longo Figure 2: Data visualization – Qlik Sense (summary sheet) Pentaho has been used for merging data sources and normalizing the corresponding datasets. Data normalization tasks have 5.4 Data Visualization addressed data redundancy and data inconsistency amongst data After the ETL process, the dataset size was reduced to 27.58M sources and within the same source. In our case study, we records from the initial 32M. This dataset has been used as the specifically checked: input for data visualization. In order to achieve fast in-memory 1) misspellings (e.g., station name and/or address, data loading and effective visualization options, we have adopted pollutant name, unit of measurement name, etc.), a widely-used, freely available, data analytics platform: Qlik 2) data formats, Sense (Desktop Version) [32]. By using Qlik Sense, we have 3) invalid values. developed a set of dashboards dedicated to the different After this phase, regional datasets can be integrated as a shared stakeholders for the examined case study (i.e., citizens, destination format to whom all different sources must conform. researchers, environment control agency personnel) with the aim Figure 3: Data visualization – Qlik Sense (pollutant detail sheet, NO2 case) 44 Challenges for Air Pollution Monitoring: CPSS2019, October 22, 2019, Bilbao, Spain a Cyber-Physical Social Systems Approach Figure 4: Detail of the vertical bar chart about the average value of the given pollutant (in this case, NO2). Filters by year (2017) and by region (Lombardy) are applied. of graphically explaining and effectively analyses performed on counted (also in this case, the province of Milan has 39 cleaned datasets. These dashboards are made up of several charts monitoring stations, which is the highest number on a per- and filters. According to the Qlik Sense terminology, the province basis), a line chart where the daily amount of recordings developed solution is defined as Qlik Sense app, while each is reported and an overall counter of the available data points in thematic group of charts represents an app sheet. the referred dataset. Depending on the user role, indeed, different charts and views can Each of the following sheets refer to a different pollutant. Figure 3 be accessed. Overall, the developed Qlik Sense app consists of 8 reports the one associated to NO2. These sheets are aimed at sheets: the first one summarizes core details while the remaining underlying relationships between detected values and regulated sheets represent a specific set of specific analyses (according to thresholds, in order to identify potential sources of concern. Each the DFM presented in Section 5.2) on each considered pollutant. sheet is formatted as specified below. Proper time-based and location-based filters have been In the top left corner, two filters (by region, by year) are available. implemented, as well. The speedometer on the right (i.e., the gauge-like chart) allows to Let us now examine with more details the sheets composing the compare the average detected value of the given pollutant type app. against its corresponding regulatory threshold (values beyond the The first app sheet (represented in Figure 2) is a summary view limit are highlighted in red). The limit value is identified by a red about all the processed records, in order to count them depending line and explicitly mentioned in the footnote of the chart. on various criteria. By moving towards the right, in the top section of the sheet, we This sheet is customizable thanks to several filters placed on the have a line chart depicting the average value detected per province left side that allow the user to refine visualization by time period on a daily basis, with an explicit indication of the threshold (by year, by month) and by location (by region, by province). The exceedances. The chart is aimed at emphasizing existing pie chart on the left shows the overall distribution of detected differences amongst provinces. It expresses its maximum potential pollutant types. For instance, it can be seen that NO2 amounts for by selecting a single region via the dedicated filter on the left, so the 39.3% of the available sensor readings (i.e, nearly 10.84M). that all the provinces in the same region can be compared, while The map on the right shows the number of records per province with no region selected it may be slightly chaotic. according to a gradient color-scale ranging from blue (less values) In the top right corner, a map is available, where all Italian to dark red (more values). As it can be seen, the province of Milan provinces are outlined. A gradient color scale is used for depicting has the largest number of records for the referred 5-year time average values per province, ranging from light brown (lower period (it is worth to point out that Figure 2 shows the overall values) to dark brown (higher values). This type of chart is very analysis with no filters applied). useful to make a straight and effective comparison of values about The lower part of the sheet hosts, going from left to right, a different areas. horizontal a bar chart where monitoring stations per province are 45 CPSS2019, October 22, 2019, Bilbao, Spain M. Zappatore, S. Refolo and A. Longo In the bottom left corner, a counter reports the number of readings As for the performed data analyses, several useful insights have for the pollutant under examination (i.e., the one the sheet is been achieved. The following list points out the most relevant associated with) depending on the filtering options. ones, per each pollutant. By proceeding towards the right, a vertical bar chart compares, on 1. PM10: Lombardy and Campania are the regions with the a daily basis, the average or maximum value (depending on the highest average value; moreover, PM10 is by far the specific pollutant) against the corresponding limit value. In order pollutant with the greatest number of threshold exceedances. to make the chart more effective, measurements are depicted in 2. PM2.5: Lombardy is still the region with the highest average blue unless they exceed the threshold (in that case they are value, with peak in Milano, Monza-Brianza and Cremona provinces. Overall, regions from Northern Italy have a higher highlighted in red). Therefore, the proportion of measurements average value than southern regions. This is due to the the going beyond the threshold is immediately evident. A detail of combination of weather conditions and vehicle density. this chart is reported in Figure 4. As it can be seen, by filtering by 3. CO: Sicily is the region with the highest number of limit time and by region, average values passing the threshold are exceedances, while Campania is the region with the highest clearly identifiable. average value. Finally, in the bottom right corner, a vertical bar chart is locate. It 4. NO2: Apulia and Sicily are the regions with the highest is used for counting the number of measurements exceeding the average value; in addition, Barletta-Andria-Trani, Bari, corresponding limit value per year. Since Italian national Taranto, Palermo and Catania provinces have an average regulations allow a given set of threshold exceedances per year value greater than the corresponding limit value. per pollutant, this chart immediately shows whether in a given 5. O3: southern regions have a higher average value than northern regions, with the exception of Valle D’Aosta that year that limit has been trespassed or not. also has a high average value. Enna and Lecce are the provinces with the highest average value. 5.4 Discussion 6. SO2: the situation is under control, since values are well Previous sections have highlighted the significant comparability below the allowed limit. Messina province is the one with issues in the regional environmental datasets. highest average value. In the Table 3, the number of available files from regional 7. C6H6: the situation is similar to the one found for SO2, if not websites (related only to year/pollutant considered for the even better. The only province with high values is Siracusa. analysis), and the overall size of the sets of files are shown. Table 3: Processed files and size per region 6 A Proposal for a CPSS Platform: APOLLON The challenges emerged so far in managing data from Italian Region No. of Files Size regional environment control agencies and the promising Basilicata 1295 135 MB approach disclosed by CPSSs in this research area have motivated the research project named APOLLON, which targets the large- Campania 5 147 MB scale, mobile-mediated sensing of pollutants in urban context, Emilia-Romagna 868 191 MB according to the CPSS principles. More specifically, the APOLLON Project [3] is a research initiative granted by Apulia Lombardia 5 694 MB Region (Italy) aimed at designing, developing and deploying a Marche 10 26 MB platform for urban environmental monitoring in terms of noise and air. Several data streams are gathered from heterogeneous Puglia 1 42 MB sources (e.g., citizen-owned personal devices, city-managed Sicilia 1 6 MB monitoring stations, etc.). The project novelty relies on: 1) integrating low-cost sensors deployed in urban area; 2) involving Toscana 592 157 MB citizens directly in monitoring campaigns according to citizen Valle D’Aosta 5 5 MB science principles; 3) sharing monitoring outcomes to city managers directly. One of the specific requirements of the Total 2782 1.4 GB platform is to build a monitoring network to integrate information The region with the largest number of files is Basilicata (1295), flows gathered from sensors with other information sources while the region whose files are the largest ones is Lombardy (694 thanks to semantic technologies and geo-referential data analysis MB). The last row shows that the total number of files used for utilities so that useful insights and high-level correlations can be this analysis are 2782, while the total weight of all these files is achieved in near-time. 1.4 GB. The architecture of the APOLLON system is organized into four As for memory consumption, the developed full Qlik Sense app layers (Figure 5). The IoT layer includes devices able to collect has included 8 sheets and 65 charts (interactive elements) for an information on the environment (i.e., mobile and stationary overall memory occupancy of nearly 1.6 GB. environmental sensors). The data layer is devoted to process, integrate and store heterogeneous data sources (social data, sensors, climatic data, clinical data, open data, etc.). The business layer is a central processing layer that executes the business logic 46 Challenges for Air Pollution Monitoring: CPSS2019, October 22, 2019, Bilbao, Spain a Cyber-Physical Social Systems Approach and communicates with the persistence level. Finally, the collection to be exposed for processing and cleaning semantic Decision Support System (sDSS) represents the interface operations provided by the “Data Management” and “Data level between the system and the end user that manages all Processing” blocks; services related to the interaction with the user (analyses, • Health data storage: area health data provided by local reporting, cartography, etc.). More specifically, the data layer is in health authorities and Ministry of Health Web portal (e.g., charge to manage the acquired data according to specific ETL admissions for respiratory diseases, mortality data, etc.); (Extraction, Transformation and Loading) procedures, by • Open data storage: area for the collection of the data exploiting typical functionalities of Decision support Systems and streams coming from ARPA (i.e., Italian regional agency a microservice-based architecture. for the environmental protection) junction boxes and The architecture briefly described so far is compliant with core meteorological stations; CPSS principles (see Section 3). Moreover, the platform backend • User profile/community of interest storage: area for features a set of components specifically dedicated to data registering and managing users involved in the project; management and included in the so-called Hybrid Storage Layer (HSL), made up of five elements: the Data Management, the Data • Multidimensional data storage: multidimensional analyses on collected data to highlight any existing correlation; Processing and the Message Management block plus a Service Catalogue that indexes and exposes available services. • Semantic storage: area for ontologies and linked data; The HSL allows to manage structured/semi- • IoT sensor data storage: area for collecting the data structured/unstructured data, and to manage all storage solutions streams coming from sensors; provided in the APOLLON Data Lake (health data, open data, multidimensional data, user profile/community of interest data, • Social data storage: the area required for the sentiment analysis phase on data coming from social networks; semantic data, IoT sensor data, social data and urban geospatial data). The HSL contains relational, non-relational, • Urban Geospatial data storage: the area aimed at hosting multidimensional, and SFTP type storage systems. Specifically, the thematic cartography related to pollutants and we consider the following storage solutions: weather-climatic stuffs. • Staging area: temporary storage area for raw data At the moment of writing this paper, the APOLLON platform Figure 5: APOLLON platform (logical architecture). 47 CPSS2019, October 22, 2019, Bilbao, Spain M. Zappatore, S. Refolo and A. Longo deployment is currently under way and the first two pilot sites are REFERENCES providing the first datasets coming from citizens. Preliminary [1] WHO (World Health Organization), “Ambient Air Pollution: A Global assessments have shown clearly the potential of the proposed Assessment of Exposure and Burden of Disease,” 2016. [2] WHO (World Health Organization), “How air pollution is destroying our CPSS-based approach, in terms of platform scalability, learning health,” 2019. [Online]. Available: https://www.who.int/air- potential for end users, involvement of end users, engagement of pollution/news-and-events/how-air-pollution-is-destroying-our-health. [3] The Center for Public Integrity, “Most EPA Pollution Estimates Are policy makers and city managers, suitability to further integration Unreliable, So Why Is Everyone Still Using Them?,” EcoWatch, 2018. with additional systems (such as analysis of population healthcare [Online]. Available: https://www.ecowatch.com/epa-emission-factors- status). As for the integration of mobile sensed data with whose 2529636639.html. [4] WHO (World Health Organization), Monitoring ambient air quality for of official statistics, crucial aspects are the specialization of health impact assessment. 2002. completeness considering the representativeness, selectivity and [5] European Union, “Guidance on Assessment under the EU Air Quality Directives,” 2005. sparsity aspects, the trustworthiness in the security quality [6] IAQM (Institute of Air Quality Management), “A guide to the assessment dimension and the specialization of accuracy, consistency and of air quality impacts on designated nature conservation sites,” London, redundancy aspects [28]. UK, 2019. [7] P. Barn, P. Jackson, N. Suzuki, and T. Kosatsky, “Air Quality A thorough analysis of the effectiveness of the proposed CPSS- Assessment Tools: A Guide for Public Health Practitioners,” Vancouver, based approach will be performed in the upcoming months. Canada, 2011. [8] EPA (Environment Protection Authority) - South Australia, “Ambient Air Quality Assessment,” 2016. [9] Italian Ministry for the Environment Land and Sea, “Environmental 7 Conclusions Challenges - Summary of the State of the Environment in Italy,” Rome, Italy, 2009. In this paper, a thorough analysis of the current Italian scenario in [10] T. A. J. Kuhlbusch, “Challenges and the Future of Urban Air Quality Monitoring in Europe,” 2014. terms of available and comparable institutional datasets for air [11] J. Zeng, L. T. Yang, M. Lin, H. Ning, and J. Ma, “A survey: Cyber- pollution monitoring has been performed. By starting from physical-social systems and their system-level design methodology,” available data sources (i.e., datasets published by Italian regional Futur. Gener. Comput. Syst., 2016. [12] EEA (European Environment Agency), “Air quality in Europe - 2018 environment control agencies), a series of shortcomings has been Report,” Luxembourg, 2018. identified, ranging from data heterogeneity, inconsistency and [13] Down To Earth, “Air Pollution: PM Levels Continue to Exceed EU Limit in Large Parts of Europe,” 2016. [Online]. Available: incompleteness to significant limitations in accessing monitoring https://www.downtoearth.org.in/news/air/air-pollution-pm-levels- data. continue-to-exceed-eu-limit-in-large-parts-of-europe-56427. A solution for collecting, processing, aggregating and visualizing [14] ISPRA, “Qualità dell’Ambiente Urbano,” Rome, Italy, 2014. [15] ISPRA (Istituto Superiore per la Protezione e la Ricerca Ambientale), air pollution datasets from a subset of Italian regions, referring to “Annuario dei Dati Ambientali (Environmental Data Yearbook) 2018,” a five-year time period and to a subset of standardized pollutants 2019. [16] European Commission, “Air Quality - Existing Legislation,” has been proposed. This approach highlighted data-related Environment. [Online]. Available: challenges to the adoption of Cyber-Physical Social Systems https://ec.europa.eu/environment/air/quality/existing_leg.htm. [17] EEA (European Environment Agency), “The Air Quality Monitoring (CPSS) in this sector. Both these challenges and the analysis Situation in Europe: State and Trends,” 2016. [Online]. Available: insights achievable in the visualization process have been https://www.eea.europa.eu/publications/92-9167-058-8/page010.html. presented in this paper. Moreover, by starting from the elements [18] C. B. B. Guerreiro, V. Foltescu, and F. de Leeuw, “Air quality status and trends in Europe,” Atmos. Environ., vol. 98, pp. 376–384, 2014. identified during the design and implementation steps of the [19] EEA (European Environment Agency), “The air quality monitoring proposed solution, a regional CPSS addressing noise and air situation in Europe - State and trends,” 2016. [Online]. Available: https://www.eea.europa.eu/publications/92-9167-058-8/page010.html. pollution monitoring, has been devised, as the first step towards [20] S. Devarakonda, P. Sevusu, H. Liu, R. Liu, L. Iftode, and B. Nath, “Real- the adoption of CPSS for environmental monitoring nationwide. time air quality monitoring through mobile sensing in metropolitan The CPSS platform, named APOLLON has been described in the areas,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, p. 8. final section of the paper. [21] ECMWF, “Monitoring Air Pollution Across Europe,” 2019. [Online]. In the next near future, challenges related to the integration of big Available: https://atmosphere.copernicus.eu/monitoring-air-pollution- across-europe. data and official statistics will be investigated in order to properly [22] ISPRA (Istituto Superiore per la Protezione e la Ricerca Ambientale), exploit the potentials of mobile crowd sensing for urban “Sistema InfoAria SINAnet,” 2018. [Online]. Available: environmental pollution monitoring. Main dimension clusters of http://www.webinfoaria.sinanet.isprambiente.it/. [23] B. Guo, Z. Yu, and X. Zhou, “A Data-Centric Framework for Cyber- data quality will be detailed and analyzed in the domain of big Physical Social Systems,” IT Prof., vol. 17, pp. 4–7, 2015. data from mobile crowd sensors and approaches to effectively [24] J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An information framework for creating a smart city through internet of things,” IEEE include people as data scientists will be described. Internet Things J., vol. 1, no. 2, pp. 112–121, 2014. [25] X. Du, O. Emebo, A. Varde, N. Tandon, S. N. Chowdhury, and G. Weikum, “Air quality assessment from social media and structured data: ACKNOWLEDGMENTS Pollutants and health impacts in urban planning,” in 2016 IEEE 32nd This work was supported in part by the research project International Conference on Data Engineering Workshops, ICDEW 2016, 2016, pp. 54–59. “APOLLON - environmentAl POLLution aNalyzer”, within the [26] S. Kuznetsov, “Authoring urban landscapes with air quality sensors,” in “Bando INNONETWORK 2017” funded by Regione Puglia Proceedings of Sigchi Conf. on Human Factors in Computing Systems, 2011, pp. 2375–2384. (Italy) in the framework of the “FESR - Fondo Europeo di [27] L. Sanchez et al., “SmartSantander: IoT experimentation over a smart city Sviluppo Regionale”. testbed,” Comput. Networks, vol. 61, pp. 217–238, 2014. [28] C. Batini, A. Rula, M. Scannapieco, and G. Viscusi, “From data quality to big data quality,” J. Database Manag., vol. 26, no. 1, pp. 60–82, 2015. 48 Challenges for Air Pollution Monitoring: CPSS2019, October 22, 2019, Bilbao, Spain a Cyber-Physical Social Systems Approach [29] M. Ehling, “Harmonising Data in Official Statistics,” in Advances in Cross-National Comparison, Boston, MA: Springer US, 2003, pp. 17–31. [30] M. Golfarelli and S. Rizzi, Data Warehouse Design, Modern Principles and Methodologies, 1st ed. McGraw-Hill, 2009. [31] Hitachi Group, “Pentaho Community Edition (CE): Data Integration, Business Analytics and Big Data,” 2016. [Online]. Available: http://www.pentaho.com. [Accessed: 01-Nov-2016]. [32] QlikTech International AB, “Qlik Sense - Data Analytics Platform,” 2019. [Online]. Available: https://www.qlik.com/us/products/qlik-sense. 49