=Paper= {{Paper |id=Vol-2530/paper6 |storemode=property |title=Challenges for Air Pollution Monitoring: A Cyber-Physical Social Systems Approach |pdfUrl=https://ceur-ws.org/Vol-2530/paper6.pdf |volume=Vol-2530 |authors=Marco Zappatore,Sergio Refolo,Antonella Longo |dblpUrl=https://dblp.org/rec/conf/iot/ZappatoreRL19 }} ==Challenges for Air Pollution Monitoring: A Cyber-Physical Social Systems Approach== https://ceur-ws.org/Vol-2530/paper6.pdf
                           Challenges for Air Pollution Monitoring:
                          a Cyber-Physical Social Systems Approach
        Marco Zappatore                                     Sergio Refolo                                       Antonella Longo
    Hesplora s.r.l. & University of                  Dept. of Innovation Engineering                     Dept. of Innovation Engineering
        Salento, Lecce, Italy                              University of Salento                              University of Salento
    marco.zappatore@hesplora.it,                                Lecce, Italy                                       Lecce, Italy
      marcosalvatore.zappatore                     sergio.refolo@studenti.unisalento.it                  antonella.longo@unisalento.it
           @unisalento.it


ABSTRACT                                                                   Similarly, citizens must be made aware of air quality status and
                                                                           how to be more actively involved in actions aimed at improving
Air pollution control plays a pivotal role today in urban contexts,        daily life quality conditions.
as both citizens and public administrators are increasingly                The attention on environmental issues is continuously increasing
sensitive about it. Traditional air pollution sensing is performed         and it involves more and more people: this results in a rising
and managed by public institutions with professional and                   number of in-domain researches. However, people hardly can take
expensive equipment, thus exhibiting a series of inherent                  any conclusion on the topic by themselves and usually the correct
limitations such as isolated monitoring campaigns, data                    interpretation of research findings is troublesome for non-
heterogeneity, inconsistency and incompleteness, limited access to         professional recipients [3]. In the most common scenario (usually
sensed data. Cyber-Physical-Social Systems (CPSS) promise to be            known as institutional monitoring), professional and expensive
a considerable step forward, as they promote the systematic                sensors are placed in the close proximity of few significant areas
involvement of citizens in monitoring processes and the                    (e.g., airports, hospitals, congested roads, etc.) by authorized
provisioning of proactive services to end users. However, several          agencies or public bodies devoted to environmental control on
elements hinders such a model. In this paper, we will discuss the          national, regional or even smaller scale. Raw data are collected
challenges of applying cyber-physical paradigm to air pollution            and published online by the same agencies. This approach is
monitoring in smart cities, exemplifying the issues on the Italian         worldwide adopted and falls under the definition of air quality
case study and then we will show how CPSS will go over them                assessment [4]–[8]. Published data come directly from sensors
and outline novel research directions.                                     (i.e., raw data) or from simple data manipulation processes and
                                                                           usually no inferred knowledge is provided in a simple and
KEYWORDS                                                                   effective way, especially when data sources and data formats
Cyber-Physical Social Systems, Air Monitoring, Data Processing,            differ significantly. It is, therefore, widely accepted that dedicated
Data Visualization, Mobile Crowd Sensing.                                  data processing solutions are needed in order to clean data from
                                                                           unwanted noise, thus focusing on what really matters [9], [10].
                                                                           This implies the need of monitoring outcomes effectively
1    Introduction                                                          presented to final users, in order to provide meaningful insights to
Air quality monitoring is a strategic and long-term activity that          the different involved actors, as citizens’ needs differ from those
gives experts the opportunity to make evaluations about air                exhibited by city administrators.
pollution, to study emission causes and sources as well as to              Cyber-Physical-Social Systems (CPSSs) promise to be a valuable
develop corrective or mitigation plans. The air quality status cast        solution for urban monitoring scenarios as they leverage on the
several concerns amongst experts as well as citizens due to its            availability of scores of heterogeneous sensors whose readings are
related health risks [1], [2]. Therefore, it is evident how air            collected, aggregated and analyzed by cyber processes and
pollution control is necessary to prevent human diseases and to            profitably merged to real-time, city-related data provided and
protect ecosystems. That is why it must be addressed by local              shared by complementary social sources in order to be presented
authorities and policy makers, as well as it should be a                   as relevant information to citizens and authorities [11]. However,
responsibility for the stakeholders in the industrial sector.              this paradigm is far to be applied on a large scale. In this paper we
                                                                           will focus on the Italian scenario, by examining the existing
1st Workshop on Cyber-Physical Social Systems (CPSS2019).                  solution and by proposing a first step towards the adoption of
October 22, 2019, Bilbao, Spain.
                                                                           CPSSs. Currently, the Italian situation features traditional air
Copyright © 2019 for this paper by its authors. Use permitted under        pollution assessments based on local sensing stations that, even if
Creative Commons License Attribution 4.0 International (CC BY 4.0).        reliable and properly manned, do not guarantee a wide coverage
                                                                           of monitoring campaigns (due to high costs and lack of skilled
                                                                           personnel) and expose several data heterogeneity issues, thus

                                                                      38
Challenges for Air Pollution Monitoring:
                                                                                                             CPSS2019, October 22, 2019, Bilbao, Spain
a Cyber-Physical Social Systems Approach


making difficult any data comparison and aggregation on a wider            2     Air Pollution: the Current Scenario
scale.
For such a reason, in this paper we will thoroughly examine air            2.1       Legislation in Europe
pollution monitoring data provided by Italian regional agencies            Air pollution is an important environmental and societal issue that
for environmental protection. A proper data model will be devised          impacts on human health, ecosystems and climate changes.
in order to aggregate data coherently. Data manipulation pipelines         Several official reports have addressed so far this topic, trying to
will be applied to collected data in order to aggregate and to             propose regulations to be applied on large scale. For instance, the
visualize them properly with the help of business intelligence             2016 report of air quality in Europe [12] focuses on the scenario
tools. This procedure highlights data incompleteness and                   in the EU Member States. It shows that a large portion of the
heterogeneity coming from institutional sources. For partly                European population (as well as the ecosystems in the same
overcoming the issue we propose a CPSS that is currently under             region) is exposed to air pollution levels that exceed European
development in the framework of an Italian regional research               standards and World Health Organization (WHO) Air Quality
project, aimed at large-scale, low-cost urban environmental                Guidelines (AQGs).
pollution monitoring.                                                      The most significant provenance of air pollutants is represented
The paper is structured as follows: section 2 introduces the               by anthropogenic sources. They encompass transportation
domain of our investigation and the corresponding research                 systems, industry, power plants, agriculture machineries and
questions. In section 2 the addressed scenario is described in             household appliances.
detail. Section 3 deals with Cyber-Physical-Social Systems.                Regardless of their origin, air pollutants can be divided into two
Section 4 describes the addressed scenario while Section 5 shows           main categories: primary and secondary ones. Primary pollutants
our data analysis approach. Achieved results are discussed in              are directly released into the environment from the processes that
Section 6, along with the proposed CPSS modelling. Finally,                generate them. The main pollutants belonging to this class (e.g.
Section 7 draws conclusions.                                               CO, NOx, SOx) are the result of combustion processes.
                                                                           Secondary pollutants derive from primary ones, and are obtained

   Table 1: European legislation about emissions
                                       Pollutants*
     Policies                                                                  SO2
                                                              NO2, NOx,                                              BaP       /
                                                  PM   O3                      ,       CO     Heavy metals                         VOCs
                                                              NH3                                                    PAH
                                                                               SOx
     Directives      2008/50/EC    (EU,                                        SO2
                                                  PM   O3     NO2, NOx                 CO     Pb                                   Benzene
     regulating      2008)
     ambient         2004/107/EC (EU,
                                                                                              As, Cd, Hg, Ni         BaP
     air quality     2004)
                     (EU)     2015/2193                                        SO2
                                                  PM          NOx
                     (EU, 2015)
                     2001/81/EC    (EU,                                        SO2                                                 NMVO
                                                              NOx, NH3
                     2001)                                                                                                         C
                                                                                              Cd, Tl, Hg, Sb,
                     2010/75/EU            (EU,                                SO2
                                                  PM          NOx, NH3                 CO     As, Pb, Cr, Co,                      VOC
     Directives      2010a)
                                                                                              Cu, Mn, Ni, V
     regulating
                     European standards                                                                                            VOC,
     emissions
                     on road vehicle              PM          NOx                      CO                                          NMVO
     of      air
                     emissions                                                                                                     C
     pollutants
                     2012/46/EU (EU,
                                                  PM   NOx                     CO                                                  HC
                     2012)
                     94/63/EC      (EU,
                                                                                                                                   VOC
                     1994)
                     2009/126/EC (EU,
                                                                                                                                   VOC
                     2009c)
   *Pollutants: PM: Fine particles; O3: Ozone; NO2: Nitrogen dioxide; NOx: Nitrogen oxides; NH3: Ammonia; SO2: Sulphur dioxide; SOx: Sulphur
   oxides; CO: Carbon monoxide; CO: Carbon monoxide; BaP: Benzo[a]pyrene; PAH: Polycylcic Aromatic Hydrocarbon; VOC: Volatile Organic
   Compound; NMVOC: Non-Methane VOC; HC: Hydrocarbons; Pb: Lead, As: Arsenic; Cd: Cadmium; Co: Cobalt; Cr: Chromium; Cu: Copper; Hg:
   Mercury; Mn: Manganese; Ni: Nickel; Sb: Antimony; Tl: Thallium; V: Vanadium.


                                                                      39
CPSS2019, October 22, 2019, Bilbao, Spain                                                                       M. Zappatore, S. Refolo and A. Longo


from their transformation due to reactions usually involving               decree D.Lgs.155/2010, defines how to evaluate and manage air
oxygen and light: oxidation is therefore a phenomenon strictly             quality for human health defense and environment protection.
correlated to this pollutant’s category.                                   In Table 2 we summarize the currently enforced D.Lgs.155/2010:
More specifically, PM (particulate matter), BaP (benzo[a]pyrene)           it presents pollutant concentration, reference averaging period,
and mercury (Hg) emissions come from the incomplete                        legal nature of the specific norm enlisted, permitted exceedances
combustion of various fuels, while emissions of ammonia (NH3)              per year and limit values for each pollutant.
or CH4 (methane) from agriculture. The current trend about PM              In Italy air quality monitoring is decentralized and performed
foresees threshold exceedances even in 2020: PM with a diameter            autonomously by regional or local agencies for environmental
of about 10 µm (henceforth, PM10) exceeds the EU limit value in            protection: each agency deals only with its own territory.
21 of the 28 EU Member States, while PM 2.5 (i.e., particles               These agencies, named ARPA (whose acronym stands for
whose diameter is nearly 2.5 µm) exceeds on average in 4 states            Regional Agency for Environmental Protection, in Italian) are
[13].                                                                      public institutions that provide technical support to Italian
The transport sector and the industry have been taking a                   regional administrations (except for Trentino-Alto Adige, which
considerable reduction of their emissions of air pollutants in             has been split into the two autonomous provinces of Trento and
Europe since 2000 (except for BaP and Cadmium, Cd, emissions               Bolzano) to perform environmental control and enforce
in transports, and CH4 and BaP in industry). The trend of                  regulations.
commercial, institutional and households’ emissions is less                These agencies, born in 1993 and nationally coordinated by SNPA
positive, with a 3% increase in BaP from 2000 to 2014. Moreover,           (The National System for Environmental Protection, in Italian),
less significant reductions of air pollutants have been experienced        are nationwide dedicated to yearly environmental quality
in agriculture.                                                            assessments. On the one hand, the decentralization in local
In Table 1 the most relevant European directives concerning air            agencies implies detailed control over a relatively limited portion
pollution are reported.                                                    of the national territory. On the other hand, however, this causes
The main goal of monitoring campaigns is providing indicators to           heterogeneity across the different regions due to the lack of shared
define emissions trend; the following list collects the main               data format and collection, management and publication policies.
indicators used in national monitoring campaigns with the related          As a consequence, even if the agencies apply the same
reference directives [14], [15]:                                           environmental control methodologies and comply with the same
1. Greenhouse gases (CO2, CH4, N2O) – Framework                            regulations, citizens experience different air-pollution-related
      Convention on climate change (1992) ratified with L 65 of            monitoring services and tools depending on the agency they refer
      15/01/94; Kyoto Protocol (1997) ratified with L 120 of               to. Moreover, different regions present different levels of detail
      01/06/02; CIPE resolution 19/12/02; D.Lgs. 51/08; D.Lgs. n.          about information offered by their environmental agencies and
      30 13/03/13                                                          this makes difficult to compare directly data coming from
2. Acidifying substances (SOx, NOx, NH3) – Goteborg                        different locations.
      Protocol (1999); NEC (2001/81/CE) directive; D.Lgs. 171/04
                                                                           This scenario does not facilitate the analysis of the overall Italian
3. Particulate – LCP 2001/80/CE directive; CE 715/2007
      regulation; CE 595/2009 regulation                                   pollution scenario. Indeed, it is not possible to carry out this task
4. Carbon monoxide (CO) – D.Lgs. n. 152 of 03/04/2006;                     properly without any technical knowledge needed to overcome the
      97/68/CE directive; 98/77/CE directive                               technical issues briefly sketched above. The support of a software
5. Benzene (C6H6) – L 413 of 04/11/97                                      application capable to normalize and integrate different sources is,
6. Persistent organic pollutants (IPA) – Aarhus Protocol (1998);           at present, fundamental in order to make readable and
      L 125/06                                                             understandable huge amount of available data merged from
7. Heavy metals – Aarhus Protocol (1998)                                   several monitoring agencies. This, in addition to the possible
                                                                           presence of supporting and complementary data sources provided
Humans can be adversely affected by exposure to air pollutants in          by citizens, would be the ideal scenario for the implementation of
ambient air. In response, the European Union has produced an               the CPSS paradigm. However, such a scenario is still far to come.
extensive body of legislation which establishes health-based
                                                                           Table 2: D.Lgs. 155/2010
standards and objectives for several air pollutants. These
                                                                                         C*
objectives are developed over different periods because pollutants          P*                      Tavg*        TVED*/LVED*                AE*
                                                                                         [µg/m³]
impact human health in different ways according to exposure time
                                                                                                                 TVED: 1.1.2010
(we refer the interested reader to the existing-legislation section         PM 2.5        25           1Y⁑                                  n/a
                                                                                                                 LVED: 1.1.2015
related to air quality in the EC Web portal [16]).
                                                                                          350          1h        LVED: 1.1.2005             24
                                                                            SO2
2.2 Legislation and Environmental Control Agencies in                                     125          24h       LVED: 1.1.2005             3
the Italian scenario
                                                                                          200          1h        LVED: 1.1.2010             18
In this paper our analysis is focused on the Italian situation: the         NO2
2008/50/CE directive, implemented in Italy with the legislative                           40           1Y        LVED: 1.1.2010             n/a
                                                                            PM10          50           24h       LVED: 1.1.2005             35

                                                                      40
Challenges for Air Pollution Monitoring:
                                                                                                                    CPSS2019, October 22, 2019, Bilbao, Spain
a Cyber-Physical Social Systems Approach


                  40            1Y         LVED: 1.1.2005           n/a            monitoring stations for long time periods (at least 6 months).
                                                                                   These stations are sometimes relocated to other sites, due to their
 Pb               0.5           1Y         LVED: 1.1.2005⁂          n/a            limited number. Large amounts of collected raw data are made
                                Max                                                openly available as daily or annual datasets in (semi-structured)
 CO               0,010                    LVED: 1.1.2005           n/a
                                8h                                                 text formats such as .csv, .xls(x) or .json.
 Benzene          5             1Y         LVED: 1.1.2010           n/a            Data heterogeneities affect the Italian scenario as well: regional
                                Max                                 25d/           environment control agencies do not share a common data
 Ozone            120                      TVED: 1.1.2010                          publication format and do not comply with a unified template for
                                8h                                  3Y
*P: Pollutant name; C: Pollutant concentration; Tavg: Averaging period;            publishing data. Each agency publishes validated data on a
TVED: Target Value Enforcement date; LVED: Limit Value Enforcement                 daily/weekly basis on its own Web portal but adopts different data
Date; AE: Permitted exceedances each year.                                         visualization strategies and offers a variable set of tools for data
⁑
  : Y: Year; h: Hour; d: Day; Max 8h: Maximum daily 8 hour mean.                   manipulation, ranging from simple data filtering to customized
⁂
   : or 1.1.2010 in the immediate vicinity of specific, notified industrial        chart composition. Data granularity is inconsistent as well, as in
sources; 1.0 µg/m³ limit value applied from 1.1.2005 to 31.12.2009.                some cases users can access single-day datasets while larger
                                                                                   datasets are available in other cases, thus determining critical gaps
2.3 Monitoring Networks and Data Availability                                      in user experience.
Air monitoring is a long-term activity and it requires necessarily                 The lack of a common standard hinders the chance of joint
careful studies. Usually, a monitoring network (i.e., a set of                     analysis: inconsistency between data formats, data structure or
monitoring stations positioned in places of interest which provides                detection metrics affect research potentials and limit non-
some measures) is required. Monitoring stations record data about                  professionals from acquiring environmental awareness.
pollutants concentration in the lower atmosphere: through specific                 However, as pointed out throughout the text, the most significant
tools they perform measurements summarized in indicators, which                    issue affecting the Italian scenario is represented by the absence of
are useful to make comparisons with limit values defined by                        an institutional unified platform allowing users to access, navigate
directives and to know whether the situation is safe or not.                       and manage monitoring data on a national scale.
In [17] the EU scenario in terms of air quality monitoring is                      From a legislative perspective, a federal council of Italian regional
reported: monitoring campaigns are usually performed all year                      environment control agencies has been established in 2016 and a
long with urban/local or regional scope. Monitoring stations are                   national air information system (SINAnet) [22] has been
categorized into traffic, urban industrial or rural industrial                     established. However the council only promotes administrative
locations. While there is a substantial homogeneity in these                       cooperation amongst agencies and the national information
aspects amongst EU countries, data availability and data reporting                 system is not open to the public yet. Indeed, at the moment of
differ significantly amongst Member States. As for data                            writing this paper, the system is accessible only by authorized
availability, the following categories can be identified: 1)                       personnel from regional agencies (i.e., ARPAs).
validated data available for authorities only; validated data
available for the public after a time delay (normally 1 day for data
validation procedures); non-validated data available for the public                3   CPSSs for environmental monitoring
in real-/near-time. Data reporting is also variegated: in some                     Cyber-Physical Social Systems (CPSS) are rooted into Cyber-
countries it is not performed on a nationwide scale, in some                       Physical Systems (CPS) and Cyber-Social Systems (CSS) [11].
others, instead, annual reports are published by environment                       Therefore, CPSS are made up of multiple layers of sensors and
control agencies.                                                                  actuators capable of monitoring physical phenomena and people’s
However, data are sometimes incomplete and not certain. For                        actions and of cyber components capable of receiving sensor data
instance, 15 EU Member States reported uncertainty in their                        and generate digital representations of the monitored world (i.e.
emission estimations and, in 2014, nearly 33% of data was                          the digital twins), so that specific actions can be implemented
incomplete [18], [19]. In this context, therefore, proper data                     accordingly. Sensing layers are usually populated by IoT (Internet
cleaning and management operations become essential in order to                    of Things) sensors, mobile devices, and WSNs (Wireless Sensor
make data usable and to minimize errors [20]. As a consequence,                    Networks) that provide time-referenced and geo-referenced
existing approaches to air pollution monitoring leverage                           datasets. In addition to them, social data streams are managed, as
significantly on big data and data mining solutions.                               well. Therefore, CPSS represent an evolution of IoT applications
Several actions are underway in order to cope with this scenario,                  and are based on the integration of physical, cyber and social
such as the Copernicus Atmosphere Monitoring Service (CAMS),                       spaces, so that new knowledge can be inferred and the interactions
implemented by the EU Centre for Medium-Range Weather                              with humans can easier happen. The core idea is that
Forecast (ECMWF) [21], aimed at reducing air pollution effects                     heterogeneous data sources from the physical world are fed to
and the concentration of toxic breathable elements.                                data processing and analytics processes, thus enabling further data
In Italy, monitoring campaigns are performed in sensitive                          fusion procedures whose output can be used by end-user
locations (e.g., high-density traffic hotspots, airports, schools,                 applications, as described in the so-called data-oriented CPSS
downtown areas, industrial sites, etc.) by positioning fixed                       functional architectural model [23], where a CPSS solution for a
                                                                              41
CPSS2019, October 22, 2019, Bilbao, Spain                                                                       M. Zappatore, S. Refolo and A. Longo


urban scenario is described as a set of “data sourcing, collection         have developed a solution for collecting, managing and
and analysis mechanisms in order to obtain city intelligence”.             visualizing data. We have analyzed a 5-year range (from 2013 to
More specifically, in [23], the authors consider a CPSS as built on        2017) by referring to standardized pollutants only (i.e., C6H6,
top of three core elements. The first one is represented by                CO, NO2, O3, SO2, PM10, PM2.5). The following subsection
collaborative sensing sources, operating according to multiple             will deal with the dataset.
sensing paradigms but sensing the same physical contexts. This
element, therefore, not only consists of traditional WSNs and IoT          4.1 Referred Dataset
nodes but also of “smartphone-carrying citizens” who become                Initially, all Italian regions were considered for the analysis: this
“valuable sensing resources”. The second core element is given             allowed us to sketch the overall scenario and to identify
by data analysis tools, needed in order to highlight any existing          differences in the way regions perform the same task. The very
spatial/temporal or content-related pattern (or correlation)               first aspect is that, despite the availability of the federal council
amongst datasets from different sources in order to increase               and of SINAnet platform (see Section 3), the accessibility and
context awareness. The third element is provided by cross-spatial          availability of monitoring data vary depending on the region, thus
data fusion tools, which are in charge of mining collected                 making troublesome to perform analyses and comparisons on a
multimodal datasets and cope with heterogeneous measurement                national scale. We used data available online via the regional
scales, combination of quantitative variables and qualitative              ARPA portals.
classifications, etc.                                                      For this reason, data integration is crucial, in order to merge files
Several CPSS solutions based on this model have been proposed              from different sources and define a shared and common data
in the recent years, addressing a wide range of applications. The          format.
studies that specifically tackled urban environmental monitoring           At the starting point the count of overall data spanned across a
can be clustered depending on the targeted application. For                time window from 2010 to 2018 and amounted nearly 71M
instance, the urban noise mapping problem has been addressed in            records. The overall dataset exhibited a significant heterogeneity
[24] by adopting a fixed and mobile sensing infrastructure,                in terms of data granularity, format and structure. Therefore, for
enriched via participatory sensors by users, but no data fusion            the sake of this cases, we selected a subset of sources in order to
solutions have been proposed. The air quality assessment has been          skim raw data before cleaning and to keep only the most
analyzed in [25], considering social data sources only (as the             homogeneous ones. This decision consisted in selecting only
adopted CPSS infrastructure was fed by tweets from citizens                those regions that provide records referred to: 1) the five-year
about perceived air pollution levels), and in [26], distributing           period 2013-2017 (because other years had less available data): 2)
sensors only across communities of people, rather than to a large          fundamentals pollutants (i.e., the standardized ones: C6H6, CO,
portion of citizens. Other CPSS approaches have been applied in            NO2, O3, SO2, PM10, PM2.5). Moreover our analysis evaluates
Santander, Spain [27], where large IoT networks were deployed              only regions whose measurements have been collected in
for environmental participatory sensing and car parking                    compliance with regulatory limits. For instance, in the case of
management, but no advanced data processing and data fusion                Lazio region, data were available in annual metrics, while
solutions were proposed.                                                   pollutant metrics must be computed daily, according to
In the following sections, we will talk about the case of Italy,           regulations. This aspect poses a severe incompatibility among
which has allowed us to identify the most significant challenges in        sources having different record granularity.
managing environmental monitoring data on national scale                   Such a preliminary record filtering operation has reduced
hindering the adoption of a CPSS approach and, subsequently, we            significantly, the size of the initial dataset, by moving from 71M
will introduce a proposal for a CPSS platform dedicated to urban           to 32M records. Selected regions are nine out of twenty:
pollution control.                                                         Basilicata, Campania, Emilia-Romagna, Lombardia, Marche,
                                                                           Puglia, Sicilia, Toscana, Valle D’Aosta.

4    Case Study                                                            4.2 Data quality issues
In order to identify current challenges in environmental                   According to the definition of data quality dimension clusters
monitoring in Italy, as introduced in Section 1, we have defined a         depicted in [28], the main issues faced in managing data coming
nationwide case study about air pollution and, subsequently, we            from regional agencies have been:




                                                                      42
Challenges for Air Pollution Monitoring:
                                                                                                              CPSS2019, October 22, 2019, Bilbao, Spain
a Cyber-Physical Social Systems Approach


     •     Completeness: data format heterogeneity and sparsity in
           terms of time (i.e., regional datasets cover different time
           periods) and types (i.e., regional datasets refer to
           different subsets of pollutants). Therefore we have been
           forced to focus on 2013-2017 time period and on 7
           standardized pollutants only. Moreover records outside
           the considered time range and not referring to the
           considered subset of standardized pollutants were
           discarded, totally or in part. For instance, regional
           datasets like those from Apulia and Lombardy regions
           were skimmed in order to keep only those subsets of
           data that complied with referenced intervals. A special
           concern was related to CO: for this pollutant, the
           majority of the analyzed regions have used an hourly
           average metric, while the regulation requires an 8-hour
           moving average instead: it has been decided to use for                Figure 1: Dimensional Fact Model
           this pollutant a metric functional to datasets, namely an           environmental pollution can significantly improve environmental
           hourly average metric. Therefore, CO records with 8-                awareness across the population. For such reasons, we have
           hour moving average metric were removed through                     designed and implemented a platform capable of merging
           proper filtering                                                    heterogeneous data about air pollution, cleaning them and
     •     Redundancy: some datasets included multiple versions                visualizing them in a meaningful and effective way, by using a
           of the same type of record (i.e., same measurement with             few dashboards.
           different type of metrics): therefore we had to select
           only one metric per record set (according to the                    5.1 Data model
           corresponding regulation) in order to cope with data
                                                                               A specific data model is behind the tool, so that data processing
           redundancy. Datasets from Campania, Sicily and Valle
                                                                               and visualization tasks can be performed rigorously and
           D’Aosta were the ones affected by data redundancy the
                                                                               coherently. We refer to the Dimensional Fact Model (DFM),
           most.
                                                                               which has been proposed by Golfarelli et al. [30] specifically to
     •     Accessibility (and the corresponding access time): some             support data mart design. This conceptual representation consists
           regions provide datasets through the website of their               of a set of fact schemata that basically model the analyzed domain
           environmental control agencies and other regions                    in terms of facts (i.e., any concept describing a time-evolving
           provide data via open data portals. In addition, for                entity relevant to decision-making processes), dimensions (i.e.,
           regions like Abruzzo, Liguria or Sardinia, it is very hard          any qualitative description of a fact, composed by dimensional
           to compose a 1-year dataset that refers to all monitored            attributes), measures (i.e., any numerical property or calculation
           pollutants. Indeed, it is possible either to download               about a fact) and hierarchies (i.e., any directed tree made up of
           records on a daily basis, referring to all the pollutants or        dimensional attribute).
           to download annual datasets referring to a single                   In our case study, the measurement fact is chosen as the most
           pollutant at a time. This leads to download scores of               significant one (Figure 1). The combination of three dimensions
           different files manually. In some cases datasets were not           (time, location and pollutant type) results in multiple potential
           accessible directly and data owners (i.e., the                      views of the same fact, so that it can be examined from multiple
           corresponding environment control agency) were                      perspectives. Several measures can be associated to the defined
           contacted with no official answer.                                  fact (e.g., number of threshold exceedances for each parameter,
                                                                               average value sensed during a given time window for a given
                                                                               parameter in a given province, etc.), so that effective numerical
5    The Analysis Platform
                                                                               indicators can be then derived and implemented into visualization
From the issues presented above non-professional users are                     dashboards.
prevented to extract meaningful insights from such little-
comparable [29] data without any technical help. To make this                  5.3 Data Processing
data effective researchers need platforms supporting the analysis              Transformations of raw data are fundamental in order to reconcile
without dedicating excessive time and computational resources to               data provided by regional environment control agencies.
data preparation and non-professional users must be supported                  Technically ETL pipeline has been developed using Pentaho
even in accessing data and then guided across data visualization               Community Edition, an open source ETL (Extraction,
options, as publicly accessible and easy-to-understand data on                 Transformation and Loading) platform [31].



                                                                          43
CPSS2019, October 22, 2019, Bilbao, Spain                                                                   M. Zappatore, S. Refolo and A. Longo




   Figure 2: Data visualization – Qlik Sense (summary sheet)

Pentaho has been used for merging data sources and normalizing
the corresponding datasets. Data normalization tasks have                5.4 Data Visualization
addressed data redundancy and data inconsistency amongst data            After the ETL process, the dataset size was reduced to 27.58M
sources and within the same source. In our case study, we                records from the initial 32M. This dataset has been used as the
specifically checked:                                                    input for data visualization. In order to achieve fast in-memory
     1) misspellings (e.g., station name and/or address,                 data loading and effective visualization options, we have adopted
           pollutant name, unit of measurement name, etc.),              a widely-used, freely available, data analytics platform: Qlik
     2) data formats,                                                    Sense (Desktop Version) [32]. By using Qlik Sense, we have
     3) invalid values.                                                  developed a set of dashboards dedicated to the different
After this phase, regional datasets can be integrated as a shared        stakeholders for the examined case study (i.e., citizens,
destination format to whom all different sources must conform.           researchers, environment control agency personnel) with the aim




   Figure 3: Data visualization – Qlik Sense (pollutant detail sheet, NO2 case)

                                                                    44
Challenges for Air Pollution Monitoring:
                                                                                                              CPSS2019, October 22, 2019, Bilbao, Spain
a Cyber-Physical Social Systems Approach




 Figure 4: Detail of the vertical bar chart about the average value of the given pollutant (in this case, NO2). Filters by year
 (2017) and by region (Lombardy) are applied.
of graphically explaining and effectively analyses performed on             counted (also in this case, the province of Milan has 39
cleaned datasets. These dashboards are made up of several charts            monitoring stations, which is the highest number on a per-
and filters. According to the Qlik Sense terminology, the                   province basis), a line chart where the daily amount of recordings
developed solution is defined as Qlik Sense app, while each                 is reported and an overall counter of the available data points in
thematic group of charts represents an app sheet.                           the referred dataset.
Depending on the user role, indeed, different charts and views can          Each of the following sheets refer to a different pollutant. Figure 3
be accessed. Overall, the developed Qlik Sense app consists of 8            reports the one associated to NO2. These sheets are aimed at
sheets: the first one summarizes core details while the remaining           underlying relationships between detected values and regulated
sheets represent a specific set of specific analyses (according to          thresholds, in order to identify potential sources of concern. Each
the DFM presented in Section 5.2) on each considered pollutant.             sheet is formatted as specified below.
Proper time-based and location-based filters have been                      In the top left corner, two filters (by region, by year) are available.
implemented, as well.                                                       The speedometer on the right (i.e., the gauge-like chart) allows to
Let us now examine with more details the sheets composing the               compare the average detected value of the given pollutant type
app.                                                                        against its corresponding regulatory threshold (values beyond the
The first app sheet (represented in Figure 2) is a summary view             limit are highlighted in red). The limit value is identified by a red
about all the processed records, in order to count them depending           line and explicitly mentioned in the footnote of the chart.
on various criteria.                                                        By moving towards the right, in the top section of the sheet, we
This sheet is customizable thanks to several filters placed on the          have a line chart depicting the average value detected per province
left side that allow the user to refine visualization by time period        on a daily basis, with an explicit indication of the threshold
(by year, by month) and by location (by region, by province). The           exceedances. The chart is aimed at emphasizing existing
pie chart on the left shows the overall distribution of detected            differences amongst provinces. It expresses its maximum potential
pollutant types. For instance, it can be seen that NO2 amounts for          by selecting a single region via the dedicated filter on the left, so
the 39.3% of the available sensor readings (i.e, nearly 10.84M).            that all the provinces in the same region can be compared, while
The map on the right shows the number of records per province               with no region selected it may be slightly chaotic.
according to a gradient color-scale ranging from blue (less values)         In the top right corner, a map is available, where all Italian
to dark red (more values). As it can be seen, the province of Milan         provinces are outlined. A gradient color scale is used for depicting
has the largest number of records for the referred 5-year time              average values per province, ranging from light brown (lower
period (it is worth to point out that Figure 2 shows the overall            values) to dark brown (higher values). This type of chart is very
analysis with no filters applied).                                          useful to make a straight and effective comparison of values about
The lower part of the sheet hosts, going from left to right, a              different areas.
horizontal a bar chart where monitoring stations per province are

                                                                       45
CPSS2019, October 22, 2019, Bilbao, Spain                                                                           M. Zappatore, S. Refolo and A. Longo


In the bottom left corner, a counter reports the number of readings            As for the performed data analyses, several useful insights have
for the pollutant under examination (i.e., the one the sheet is                been achieved. The following list points out the most relevant
associated with) depending on the filtering options.                           ones, per each pollutant.
By proceeding towards the right, a vertical bar chart compares, on             1. PM10: Lombardy and Campania are the regions with the
a daily basis, the average or maximum value (depending on the                       highest average value; moreover, PM10 is by far the
specific pollutant) against the corresponding limit value. In order                 pollutant with the greatest number of threshold exceedances.
to make the chart more effective, measurements are depicted in                 2. PM2.5: Lombardy is still the region with the highest average
blue unless they exceed the threshold (in that case they are                        value, with peak in Milano, Monza-Brianza and Cremona
                                                                                    provinces. Overall, regions from Northern Italy have a higher
highlighted in red). Therefore, the proportion of measurements
                                                                                    average value than southern regions. This is due to the the
going beyond the threshold is immediately evident. A detail of                      combination of weather conditions and vehicle density.
this chart is reported in Figure 4. As it can be seen, by filtering by         3. CO: Sicily is the region with the highest number of limit
time and by region, average values passing the threshold are                        exceedances, while Campania is the region with the highest
clearly identifiable.                                                               average value.
Finally, in the bottom right corner, a vertical bar chart is locate. It        4. NO2: Apulia and Sicily are the regions with the highest
is used for counting the number of measurements exceeding the                       average value; in addition, Barletta-Andria-Trani, Bari,
corresponding limit value per year. Since Italian national                          Taranto, Palermo and Catania provinces have an average
regulations allow a given set of threshold exceedances per year                     value greater than the corresponding limit value.
per pollutant, this chart immediately shows whether in a given                 5. O3: southern regions have a higher average value than
                                                                                    northern regions, with the exception of Valle D’Aosta that
year that limit has been trespassed or not.
                                                                                    also has a high average value. Enna and Lecce are the
                                                                                    provinces with the highest average value.
5.4 Discussion                                                                 6. SO2: the situation is under control, since values are well
Previous sections have highlighted the significant comparability                    below the allowed limit. Messina province is the one with
issues in the regional environmental datasets.                                      highest average value.
In the Table 3, the number of available files from regional                    7. C6H6: the situation is similar to the one found for SO2, if not
websites (related only to year/pollutant considered for the                         even better. The only province with high values is Siracusa.
analysis), and the overall size of the sets of files are shown.
Table 3: Processed files and size per region                                   6   A Proposal for a CPSS Platform: APOLLON
                                                                               The challenges emerged so far in managing data from Italian
 Region                     No. of Files         Size                          regional environment control agencies and the promising
 Basilicata                 1295                 135 MB                        approach disclosed by CPSSs in this research area have motivated
                                                                               the research project named APOLLON, which targets the large-
 Campania                   5                    147 MB                        scale, mobile-mediated sensing of pollutants in urban context,
 Emilia-Romagna             868                  191 MB                        according to the CPSS principles. More specifically, the
                                                                               APOLLON Project [3] is a research initiative granted by Apulia
 Lombardia                  5                    694 MB                        Region (Italy) aimed at designing, developing and deploying a
 Marche                     10                   26 MB                         platform for urban environmental monitoring in terms of noise
                                                                               and air. Several data streams are gathered from heterogeneous
 Puglia                     1                    42 MB                         sources (e.g., citizen-owned personal devices, city-managed
 Sicilia                    1                    6 MB                          monitoring stations, etc.). The project novelty relies on: 1)
                                                                               integrating low-cost sensors deployed in urban area; 2) involving
 Toscana                    592                  157 MB                        citizens directly in monitoring campaigns according to citizen
 Valle D’Aosta              5                    5 MB                          science principles; 3) sharing monitoring outcomes to city
                                                                               managers directly. One of the specific requirements of the
 Total                      2782                 1.4 GB                        platform is to build a monitoring network to integrate information
The region with the largest number of files is Basilicata (1295),              flows gathered from sensors with other information sources
while the region whose files are the largest ones is Lombardy (694             thanks to semantic technologies and geo-referential data analysis
MB). The last row shows that the total number of files used for                utilities so that useful insights and high-level correlations can be
this analysis are 2782, while the total weight of all these files is           achieved in near-time.
1.4 GB.                                                                        The architecture of the APOLLON system is organized into four
As for memory consumption, the developed full Qlik Sense app                   layers (Figure 5). The IoT layer includes devices able to collect
has included 8 sheets and 65 charts (interactive elements) for an              information on the environment (i.e., mobile and stationary
overall memory occupancy of nearly 1.6 GB.                                     environmental sensors). The data layer is devoted to process,
                                                                               integrate and store heterogeneous data sources (social data,
                                                                               sensors, climatic data, clinical data, open data, etc.). The business
                                                                               layer is a central processing layer that executes the business logic
                                                                          46
Challenges for Air Pollution Monitoring:
                                                                                                             CPSS2019, October 22, 2019, Bilbao, Spain
a Cyber-Physical Social Systems Approach


and communicates with the persistence level. Finally, the                            collection to be exposed for processing and cleaning
semantic Decision Support System (sDSS) represents the interface                     operations provided by the “Data Management” and “Data
level between the system and the end user that manages all                           Processing” blocks;
services related to the interaction with the user (analyses,                     •   Health data storage: area health data provided by local
reporting, cartography, etc.). More specifically, the data layer is in               health authorities and Ministry of Health Web portal (e.g.,
charge to manage the acquired data according to specific ETL                         admissions for respiratory diseases, mortality data, etc.);
(Extraction, Transformation and Loading) procedures, by                          •   Open data storage: area for the collection of the data
exploiting typical functionalities of Decision support Systems and                   streams coming from ARPA (i.e., Italian regional agency
a microservice-based architecture.                                                   for the environmental protection) junction boxes and
The architecture briefly described so far is compliant with core                     meteorological stations;
CPSS principles (see Section 3). Moreover, the platform backend                  •   User profile/community of interest storage: area for
features a set of components specifically dedicated to data                          registering and managing users involved in the project;
management and included in the so-called Hybrid Storage Layer
(HSL), made up of five elements: the Data Management, the Data                   •   Multidimensional data storage: multidimensional analyses
                                                                                     on collected data to highlight any existing correlation;
Processing and the Message Management block plus a Service
Catalogue that indexes and exposes available services.                           •   Semantic storage: area for ontologies and linked data;
The      HSL       allows       to     manage       structured/semi-
                                                                                 •   IoT sensor data storage: area for collecting the data
structured/unstructured data, and to manage all storage solutions                    streams coming from sensors;
provided in the APOLLON Data Lake (health data, open data,
multidimensional data, user profile/community of interest data,                  •   Social data storage: the area required for the sentiment
                                                                                     analysis phase on data coming from social networks;
semantic data, IoT sensor data, social data and urban geospatial
data).    The     HSL     contains     relational,    non-relational,            •   Urban Geospatial data storage: the area aimed at hosting
multidimensional, and SFTP type storage systems. Specifically,                       the thematic cartography related to pollutants and
we consider the following storage solutions:                                         weather-climatic stuffs.

    •    Staging area: temporary storage area for raw data                    At the moment of writing this paper, the APOLLON platform




   Figure 5: APOLLON platform (logical architecture).

                                                                         47
CPSS2019, October 22, 2019, Bilbao, Spain                                                                               M. Zappatore, S. Refolo and A. Longo


deployment is currently under way and the first two pilot sites are           REFERENCES
providing the first datasets coming from citizens. Preliminary                [1]    WHO (World Health Organization), “Ambient Air Pollution: A Global
assessments have shown clearly the potential of the proposed                         Assessment of Exposure and Burden of Disease,” 2016.
                                                                              [2]    WHO (World Health Organization), “How air pollution is destroying our
CPSS-based approach, in terms of platform scalability, learning                      health,”     2019.     [Online].      Available:   https://www.who.int/air-
potential for end users, involvement of end users, engagement of                     pollution/news-and-events/how-air-pollution-is-destroying-our-health.
                                                                              [3]    The Center for Public Integrity, “Most EPA Pollution Estimates Are
policy makers and city managers, suitability to further integration                  Unreliable, So Why Is Everyone Still Using Them?,” EcoWatch, 2018.
with additional systems (such as analysis of population healthcare                   [Online]. Available: https://www.ecowatch.com/epa-emission-factors-
status). As for the integration of mobile sensed data with whose                     2529636639.html.
                                                                              [4]    WHO (World Health Organization), Monitoring ambient air quality for
of official statistics, crucial aspects are the specialization of                    health impact assessment. 2002.
completeness considering the representativeness, selectivity and              [5]    European Union, “Guidance on Assessment under the EU Air Quality
                                                                                     Directives,” 2005.
sparsity aspects, the trustworthiness in the security quality                 [6]    IAQM (Institute of Air Quality Management), “A guide to the assessment
dimension and the specialization of accuracy, consistency and                        of air quality impacts on designated nature conservation sites,” London,
redundancy aspects [28].                                                             UK, 2019.
                                                                              [7]    P. Barn, P. Jackson, N. Suzuki, and T. Kosatsky, “Air Quality
A thorough analysis of the effectiveness of the proposed CPSS-                       Assessment Tools: A Guide for Public Health Practitioners,” Vancouver,
based approach will be performed in the upcoming months.                             Canada, 2011.
                                                                              [8]    EPA (Environment Protection Authority) - South Australia, “Ambient Air
                                                                                     Quality Assessment,” 2016.
                                                                              [9]    Italian Ministry for the Environment Land and Sea, “Environmental
7    Conclusions                                                                     Challenges - Summary of the State of the Environment in Italy,” Rome,
                                                                                     Italy, 2009.
In this paper, a thorough analysis of the current Italian scenario in         [10]   T. A. J. Kuhlbusch, “Challenges and the Future of Urban Air Quality
                                                                                     Monitoring in Europe,” 2014.
terms of available and comparable institutional datasets for air              [11]   J. Zeng, L. T. Yang, M. Lin, H. Ning, and J. Ma, “A survey: Cyber-
pollution monitoring has been performed. By starting from                            physical-social systems and their system-level design methodology,”
available data sources (i.e., datasets published by Italian regional                 Futur. Gener. Comput. Syst., 2016.
                                                                              [12]   EEA (European Environment Agency), “Air quality in Europe - 2018
environment control agencies), a series of shortcomings has been                     Report,” Luxembourg, 2018.
identified, ranging from data heterogeneity, inconsistency and                [13]   Down To Earth, “Air Pollution: PM Levels Continue to Exceed EU Limit
                                                                                     in Large Parts of Europe,” 2016. [Online]. Available:
incompleteness to significant limitations in accessing monitoring                    https://www.downtoearth.org.in/news/air/air-pollution-pm-levels-
data.                                                                                continue-to-exceed-eu-limit-in-large-parts-of-europe-56427.
A solution for collecting, processing, aggregating and visualizing            [14]   ISPRA, “Qualità dell’Ambiente Urbano,” Rome, Italy, 2014.
                                                                              [15]   ISPRA (Istituto Superiore per la Protezione e la Ricerca Ambientale),
air pollution datasets from a subset of Italian regions, referring to                “Annuario dei Dati Ambientali (Environmental Data Yearbook) 2018,”
a five-year time period and to a subset of standardized pollutants                   2019.
                                                                              [16]   European Commission, “Air Quality - Existing Legislation,”
has been proposed. This approach highlighted data-related                            Environment.                         [Online].                  Available:
challenges to the adoption of Cyber-Physical Social Systems                          https://ec.europa.eu/environment/air/quality/existing_leg.htm.
                                                                              [17]   EEA (European Environment Agency), “The Air Quality Monitoring
(CPSS) in this sector. Both these challenges and the analysis                        Situation in Europe: State and Trends,” 2016. [Online]. Available:
insights achievable in the visualization process have been                           https://www.eea.europa.eu/publications/92-9167-058-8/page010.html.
presented in this paper. Moreover, by starting from the elements              [18]   C. B. B. Guerreiro, V. Foltescu, and F. de Leeuw, “Air quality status and
                                                                                     trends in Europe,” Atmos. Environ., vol. 98, pp. 376–384, 2014.
identified during the design and implementation steps of the                  [19]   EEA (European Environment Agency), “The air quality monitoring
proposed solution, a regional CPSS addressing noise and air                          situation in Europe - State and trends,” 2016. [Online]. Available:
                                                                                     https://www.eea.europa.eu/publications/92-9167-058-8/page010.html.
pollution monitoring, has been devised, as the first step towards             [20]   S. Devarakonda, P. Sevusu, H. Liu, R. Liu, L. Iftode, and B. Nath, “Real-
the adoption of CPSS for environmental monitoring nationwide.                        time air quality monitoring through mobile sensing in metropolitan
The CPSS platform, named APOLLON has been described in the                           areas,” in Proceedings of the ACM SIGKDD International Conference on
                                                                                     Knowledge Discovery and Data Mining, 2013, p. 8.
final section of the paper.                                                   [21]   ECMWF, “Monitoring Air Pollution Across Europe,” 2019. [Online].
In the next near future, challenges related to the integration of big                Available:      https://atmosphere.copernicus.eu/monitoring-air-pollution-
                                                                                     across-europe.
data and official statistics will be investigated in order to properly        [22]   ISPRA (Istituto Superiore per la Protezione e la Ricerca Ambientale),
exploit the potentials of mobile crowd sensing for urban                             “Sistema      InfoAria      SINAnet,”      2018.    [Online].   Available:
environmental pollution monitoring. Main dimension clusters of                       http://www.webinfoaria.sinanet.isprambiente.it/.
                                                                              [23]   B. Guo, Z. Yu, and X. Zhou, “A Data-Centric Framework for Cyber-
data quality will be detailed and analyzed in the domain of big                      Physical Social Systems,” IT Prof., vol. 17, pp. 4–7, 2015.
data from mobile crowd sensors and approaches to effectively                  [24]   J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An information
                                                                                     framework for creating a smart city through internet of things,” IEEE
include people as data scientists will be described.                                 Internet Things J., vol. 1, no. 2, pp. 112–121, 2014.
                                                                              [25]   X. Du, O. Emebo, A. Varde, N. Tandon, S. N. Chowdhury, and G.
                                                                                     Weikum, “Air quality assessment from social media and structured data:
ACKNOWLEDGMENTS                                                                      Pollutants and health impacts in urban planning,” in 2016 IEEE 32nd
This work was supported in part by the research project                              International Conference on Data Engineering Workshops, ICDEW 2016,
                                                                                     2016, pp. 54–59.
“APOLLON - environmentAl POLLution aNalyzer”, within the                      [26]   S. Kuznetsov, “Authoring urban landscapes with air quality sensors,” in
“Bando INNONETWORK 2017” funded by Regione Puglia                                    Proceedings of Sigchi Conf. on Human Factors in Computing Systems,
                                                                                     2011, pp. 2375–2384.
(Italy) in the framework of the “FESR - Fondo Europeo di                      [27]   L. Sanchez et al., “SmartSantander: IoT experimentation over a smart city
Sviluppo Regionale”.                                                                 testbed,” Comput. Networks, vol. 61, pp. 217–238, 2014.
                                                                              [28]   C. Batini, A. Rula, M. Scannapieco, and G. Viscusi, “From data quality to
                                                                                     big data quality,” J. Database Manag., vol. 26, no. 1, pp. 60–82, 2015.

                                                                         48
Challenges for Air Pollution Monitoring:
                                                                                         CPSS2019, October 22, 2019, Bilbao, Spain
a Cyber-Physical Social Systems Approach

[29]      M. Ehling, “Harmonising Data in Official Statistics,” in Advances in
          Cross-National Comparison, Boston, MA: Springer US, 2003, pp. 17–31.
[30]      M. Golfarelli and S. Rizzi, Data Warehouse Design, Modern Principles
          and Methodologies, 1st ed. McGraw-Hill, 2009.
[31]      Hitachi Group, “Pentaho Community Edition (CE): Data Integration,
          Business Analytics and Big Data,” 2016. [Online]. Available:
          http://www.pentaho.com. [Accessed: 01-Nov-2016].
[32]      QlikTech International AB, “Qlik Sense - Data Analytics Platform,”
          2019. [Online]. Available: https://www.qlik.com/us/products/qlik-sense.




                                                                                    49