=Paper= {{Paper |id=Vol-1222/paper5 |storemode=property |title=The extraction and fusion of meteorological and air quality information for orchestrated services |pdfUrl=https://ceur-ws.org/Vol-1222/paper5.pdf |volume=Vol-1222 |dblpUrl=https://dblp.org/rec/conf/mir/JohanssonEKWKKVK14 }} ==The extraction and fusion of meteorological and air quality information for orchestrated services== https://ceur-ws.org/Vol-1222/paper5.pdf
         The Extraction and Fusion of Meteorological and Air
            Quality Information for Orchestrated Services
            Lasse Johansson                                    Victor Epitropou                                        Leo Wanner
  The Finnish Meteorological Institute,                      and Kostas Karatzas                          Catalan Institute for Research and
   Dept. of Atmospheric composition                    Aristotle University of Thessaloniki,                     Advanced Studies,
         Erik Palmenin aukio 1                          Dept. of Mechanical Engineering,                       Dept. of Information and
       00101, Helsinki, Finland                            54124 Thessaloniki, Greece                      Communication Technologies,
       lasse.johansson@fmi.fi                                                                            Pompeu Fabra University, Barcelona,
                                                                                                                         Spain

                         Ari Karppinen                                                              Stefanos Vrochidis
                     and Jaakko Kukkonen                                                         and Ioannis Kompatsiaris
               The Finnish Meteorological Institute,                               Information Technologies Institute, Centre for Research
                Dept. of Atmospheric composition                                        and Technology Hellas, Thessaloniki, Greece


ABSTRACT                                                                               Getting a direct answer to a seemingly simple question such as
                                                                                    “How will the air quality be tomorrow in Glasgow?” involves
   The PESCaDO system (Personal Environmental Service                               extensive manual search and expert interpretation of the often
Configuration and Delivery Orchestration) aims at providing                         contradictory and heterogeneous information found on various
accurate and timely information about local air quality and                         web sites. Furthermore, a significant portion of air quality and
weather conditions in Europe. The system receives environment                       meteorological information is published on the Internet only in
related queries from end users, discovers reliable environmental                    the form of colour-mapped, geo-referenced images [1]. Also the
multimedia data in the web from different providers and                             quality of information might vary significantly in reliability and
processes these data in order to convert them into information and                  relevance with respect to the queried location and time. On the
knowledge. Finally, the system uses the produced information to                     other hand, even biased and inaccurate information about air
provide the end user a personalized response. In this paper, we                     quality could be utilized effectively by data fusion methods in
present the general architecture of the above mentioned system,                     order to provide reliable information. The success of fusing
focusing on the extraction and fusion of multimedia                                 multiple model results is evident in the case of models with no
environmental data. The main research contribution of the                           major deviation of forecasting performance, and has been
proposed system is a novel information fusion method based on                       demonstrated in many related studies [22].
statistical regression modelling that uses as input data land use                      In this context, in [5] it has been presented an approach to
and population density masks, historic track-record of data                         provide air quality information for any location within a large
providers as well as an array of atmospheric measurements at                        geographical domain, by fusing air quality data from multiple
various locations. An implementation of this fusion model has                       sources, by using a statistical air pollution model (RIO). In a
been successfully tested against two selected datasets on air                       review of land use regression (LUR) models it has been stated
pollutant concentrations and ambient air temperatures.                              that LUR-models have been very successful in predicting annual
                                                                                    mean concentrations of NO2 and PM2.5 in urban environments [4].
1. INTRODUCTION                                                                     However, these state-of-the-art LUR models are difficult to utilize
                                                                                    for the accurate prediction of hourly concentration of air
   Recently, the emergence of social media, personalized web                        pollutants – a more dynamic approach is needed. Another
services and the increased public awareness of environmental                        complication is the extremely heterogeneous nature of input data
conditions that impact the quality of life have resulted in the                     which may contain model forecasts and observations, both with
demand for easier access to environmental information tailored to                   varying reliability, time of validity and location. Spatial and
personal requirements. In particular, in case of the atmospheric                    temporal gaps are also a matter of concern; there are only a finite
environment, there is a need for an integrated assessment of the                    number of measurement stations, and forecasting models also
impact of air pollution, allergens and extreme meteorological                       have a finite spatial and temporal resolution. These considerations
conditions on public health [9], [8]. In addition, this information                 lead to the need to use some form of data interpolation either in
has to be disseminated to citizens in an easily accessible form [7].                space or time, or both.
                                                                                       In this paper, we aim to describe the general architecture of the
                                                                                    PESCaDO system, focusing especially on the fusion of extracted
                                                                                    information [20], [21]. First, we discover environmental nodes
                                                                                    (i.e. web resources that include environmental measurements),
Copyright © by the paper’s authors. Copying permitted only for private              which are relevant to the area of interest. Then, a specific service
and academic purposes.                                                              called AirMerge is presented, which is capable of performing
In: S. Vrochidis, K. Karatzas, A. Karpinnen, A. Joly (eds.): Proceedings of         extraction and fusion of information from a wide range of online
the International Workshop on Environmental Multimedia Retrieval                    Chemical Weather (CW) forecasting systems. The online fusion
(EMR2014), Glasgow, UK, April 1, 2014, published at http://ceur-ws.org
                                                                                    service is then presented; this is a general method for the fusion of



                                                                              30
processed meteorological and air quality data, and is also the                The queries are formulated in terms of PESCaDO’s Problem
main topic of this paper. There are many definitions of data               Description Language via an interactive web interface. First, the
fusion, as it is a method that is applied to various scientific            system discovers environmental nodes that contain measurements
domains, such as remote sensing, meteorological forecasting,               for the areas of interest. Then, for each query, (i) relevant
sensor networks, etc. [19]. We use the term “fusion” to describe           environmental data sources are orchestrated, (ii) data from textual
the process of integration of multiple data and knowledge into a           and image formats in the sources are identified, extracted, fused
consistent, accurate, and useful representation. An evaluation of          and reasoned over to assess the relevance of the data for the user,
the performance of this fusion system is presented for two                 and (iii) his query and the outcome are presented in terms of a
selected cases: i) the fusion of atmospheric temperature forecasts         bulletin in the language of the preference of the user.
and (ii) the fusion of measured NO2 concentrations.                           Figure 1a illustrates the information flow of PESCaDO from
                                                                           the viewpoint of the Fusion Service, which is the backbone of the
                                                                           system. The system includes two uncoupled process chains, called
2. FRAMEWORK                                                               here as pipelines, that operate in offline and online modes. In the
   We present here an overview of the general architecture of the          offline pipeline, environmental websites that cover the region
PESCaDO system. For a more detailed description, the reader is             targeted by the user are searched for in the web and data are
referred to [20], [21].                                                    extracted from the identified sites and fed into the database of the
2.1 An overview of the PESCaDO system                                      system. We use the term ‘offline’ here since at the time of user
   The purpose of the PESCaDO system is to address the need for            query the data used by the pipeline has already been retrieved,
timely     personalized    environmental      information     (see         processed and stored into a local database. In the online pipeline
www.pescado-project.eu for more information). It first processes           user queries are processed and answered. The online pipeline
user queries, based on the personal information on the user,               starts from the specification of personal information and query by
formulated in terms of a user profile. For instance, health                the user. With this information, the system first determines which
conditions such as asthma may affect the displayed warnings and            aspects of environmental and contextual knowledge (e.g.
recommendations while the user group (e.g. citizen or                      temperature, CO2 concentration, etc.) are relevant to the user and
administrative expert) affects the level of detail and technicality        his query (cf. Fig, 1, Relevant aspects determination). Next, the
of the response.                                                           Fusion Service (FS) is given a request to produce fused
                                                                           information about the identified relevant aspects. At this stage,
                                                                           the system retrieves information from the database and starts to
                                                                           process it. The ‘relevant aspects’ could be, for instance, “NO2
                                                                           concentration and ambient air temperature, tomorrow between
                                                                           12:00 and 18:00 in a specified region in Helsinki, given the
                                                                           reported traffic density”. Furthermore, the user profile
                                                                           (administration personnel vs. citizen; healthy individual vs.
                                                                           allergic, etc.) affects the way the response is ultimately presented
                                                                           to the user (relevant aspects determination).
                                                                              The Data Retrieval Service (DRS) serves as an interface,
                                                                           through which other PESCADO services can retrieve information
                                                                           (i.e. environmental measurements) from the database. The Fusion
                                                                           Service queries the DRS to receive environmental data available
                                                                           for the requested geographic areas and time periods for all related
                                                                           environmental aspects. After the FS fuses the data retrieved from
                                                                           the DRS these are inserted to the PESCaDO Knowledge Base
                                                                           (KB).
                                                                              The PESCaDO’s KB contains, manages and provides
                                                                           information represented with the PESCaDO ontology to other
                                                                           services [13]. This KB also provides the Fusion Service with
                                                                           supporting information needed in the fusion process. This
                                                                           includes source identification and fixed coordinates if available,
                                                                           and source reliability. Furthermore, the PESCaDO ontology helps
                                                                           to translate verbal ratings into numeric form if needed. For
                                                                           instance, the expression “heavy rain” can be converted into mm/h
                                                                           numeric value with the help of the concept definitions in the
                                                                           ontology. More specifically, the KB is queried about the upper
                                                                           and lower limit for “heavy rain” in the specified region and then
                                                                           the average value of the returned limits can be taken to represent
                                                                           the input in numeric format - an approach related to the use of
                                                                           fuzzy logic methods in air quality problems [6].
                                                                              Once all input data are in numeric form, the FS fuses the data
                                                                           by one variable (e.g. temperature, wind speed, NO2 or O3) at a
                                                                           time, utilizing available uncertainty metrics for each information
Figure 1a-b: A simplified schematic diagram of the PESCaDO                 source given by the Uncertainty Metrics tool (UMT). Fused data
  system, starting from the user defined query and ending at               are stored in the KB and then the tasks, including the selection,
the delivery of response (a). An example response for the user             structuring and presentation of the information resulting from the
                    is presented in figure (b).                            fusion to the user can be carried on. In parallel, the retrieved


                                                                      31
information, which can be used for performance evaluation later             2.4 AirMerge subsystem
on, is passed to UMT and stored. Using this stored information,                A significant portion of Air Quality (AQ) related information
UMT evaluates measured values against forecasts autonomously                (in particular, Chemical Weather forecasts) is published on the
and produces updated source node uncertainty metrics.                       Internet only in the form of colour-mapped geo-referenced
2.2 Discovery of environmental nodes                                        images. Such image-based information is impossible to be parsed
   As described in the previous section, the first step realized by         via usual text-mining and screen-scraping techniques used in web
the PESCaDO framework is the discovery of environmental                     mash-up-like services. It was thus important to provide
nodes. The huge number of the nodes, their diversity both in                PESCaDO with a specialized service that allows accessing and
purpose and content, as well as, their widely varying and a priori          using CW forecast images as another source of data to use during
unknown quality, set several challenges for the discovery and the           the Orchestration and Fusion phases. Such a system, called Air
orchestration of these services [21].                                       Merge, has already been developed and described in [3], [1].
   The PESCaDO discovery framework combines the main two                       AirMerge is an open access system, which is currently
methodologies of internet domain specific search: (a) the use of            dedicated to the whole European continent (the coverage of
existing search engines for the submission of domain-specific               different territories is possible, accessing a wide number of
automatically generated queries, and (b) focused crawling of                environmental nodes containing CW information, and can
predetermined websites [23]. To support domain-specific search              automatically extract data from various data sources). These
using a general purpose search engine [12], two types of domain             images commonly have geographical spatial resolutions ranging
specific queries are being formulated: the basic and the extended.          from 1x1 km to 20x20 km, and temporal resolutions from a
Basic queries are produced by combining environmental related               minimum of one hour to an entire day [10]. The reported values
keywords (e.g. weather, temperature) with geographical data (e.g.           usually are maximum or average air pollution concentration
city names). Extended queries are generated by enhancing the                values for the selected integration time.
basic queries with additional domain-specific keywords, which
are produced using the keyword spice technique [14]. Both types
of queries are then submitted to Yahoo BOSS API search engine.
   In parallel, a focused crawler is employed, built upon the
Apache Nutch -crawler and is based on [18]. This implementation
attempts to classify sites by using hyperlink and text information
(i.e. anchor text and text around the link) with the aid of a
supervised classifier. This approach is new in comparison to a
previously presented method for web-based information
identification and retrieval with the aid of a domain vocabulary
and web-crawling tools [2].
   The output of both techniques is post-processed in order to
improve the precision of the results by separating relevant from
irrelevant nodes and categorizing and further filtering the relevant
nodes with respect to the types of environmental data they
provide (air quality, pollen, weather, etc.). The determination of
the relevance of the nodes and their categorization is done using a
supervised classification method based on Support Vector
Machines (SVM). The SVM classifiers are trained with manually
annotated websites and textual and visual features extracted from
the environmental nodes. The textual features are key phrases and
concepts extracted from the metadata and content of the
webpages using KX [15] and the vector representation is based on
the bag of words model. The visual features (MPEG-7, [17]) are
extracted from the images included in the discovered websites in
order to identify heatmaps that are usually present in air quality
forecast websites.                                                           Figure 2: Example of a PM 2.5 forecast (produced by MACC)
                                                                                conversion process using AirMerge. Bitmap data (a) is
2.3 Orchestration of environmental nodes                                    transformed into numerical form by using the colour scale c).
and data extraction                                                              The heatmap a) has been reproduced in b) using the
   Once the environmental nodes have been detected and indexed,                               converted numeric grid.
they are available as data sources or as active data consuming                In the context of PESCaDO, AirMerge apart from performing
services (if they require external data and are accessible via a web        image extraction, it acts as an autonomous web-crawling, parsing
service API).                                                               and database-storage mechanism for CW forecasts, using its own
   To distil data from text, advanced natural language parsing              means and processes which are distinct from those of PESCaDO,
techniques are applied, while to transform semi-structured web              having been developed independently. The harvested data cover
content into structured data, regular expressions and HTML trees            most of Europe for a time period going back to August 2010
are used. Data extraction from images focused on heatmap                    when it first became operational. Time resolutions range from one
analysis using the AirMerge system, described in the following              hour to a day, depending on the capabilities of the sources used.
section.                                                                      A typical set of CW models and the resulting images can be
                                                                            found in the European Open-access Chemical Weather
                                                                            Forecasting Portal described by [1], that has been developed in



                                                                       32
the frame of COST Action ES0602 (www.chemicalweather.eu).                    Thus, we assume that the variance related to is the sum of these
AirMerge is able to convert such image-based concentration maps              three individual (independent and thus summable) components,
into numerical, geographically referenced data, accounting for               given by
geographical projections, missing data, noise and the differences
in publishing formats between different model providers. The                                          ( )       ( )           (   )   (2)
result is the effective conversion of image data back into                   where ( ) is the variance component as function of , ( ) is
numerical data, which is then made directly available for a                  the temporal variance component as a function of , in which
number of numerical processing applications.
   It should be clarified that in the proposed system AirMerge has                     ||             ||                              (3a)
two roles: a) it performs image data extraction and b) it is an
                                                                                      |       –   |                                   (3b)
additional environmental node that provides environmental data
encoded in images.                                                                     (     ) in Eq. 2 describes the information source’s
                                                                             inherent quality in terms of variance, i.e., the capability to
3. FUSION OF EXTRACTED                                                       estimate (          ) at point-blank range when and are equal to
                                                                             zero. For the evaluation of             (     ) , stored information
INFORMATION                                                                  about the source’s prediction accuracy in past can be used,
   The fusion of information in an orchestrated service such as              evaluated by the Uncertainty Metrics Tool (see Fig 1). More
PESCaDO, offers several advantages to the user. First, the output            specifically, measurements and model forecasts are paired
of the system includes only one set of values instead of an                  together if they represent the same time and location and the
extensive collection of pieces of information that may not agree             statistical variance is then calculated for the population of
with each other. Secondly, the fusion result will be of a better             evaluation pairs.
quality with respect to the individual sources. Third, small                    In the presented PESCaDO framework, the location for the
geographic and temporal gaps in the input data can be                        estimator (         ) may not have been defined exactly; this is
extrapolated.                                                                usually the case, for instance, with extracted weather forecasts for
   The above mentioned services for environmental node                       cities. In these cases actually pinpoints the center of city while
discovery and data retrieval guarantee a large amount of relevant            information represents the conditions through-out the city. In such
input data which need to be fused with respect to the user defined           cases the coordinates are flagged as approximations and set
query. However individual competing pieces of information from                                ||        || , where is the radius of the city.
different nodes can seldom be regarded as equally relevant and                  The variance models ( ) and ( ) can be formulated with
thus a general measure for information relevance and quality is              statistical methods. In the fusion service these have been
needed for data fusion.                                                      formulated individually for each air pollutant species using
   In the fusion process, all pieces of meteorological and air               regression analysis with historical measurement data. For the pilot
quality data correspond to a certain time and place. These pieces            application of the method, these data represent 6 to 43
of information can be regarded as statistical estimators (       )           measurement stations across Finland, depending on the measured
or     in short, in which     is distance and     is time, for the           values. More specifically, the following simple regression models
conditions governing the area and time of interest for the user:             are employed:

       (      )     (     )                            (1)                    ( )                                                     (4a)
                                                                              ( )                                                     (4b)
where      / is the coordinate vector for the location of interest /
location associated with the estimator,          /    is the time of         where parameters            and      are defined with statistical
interest / estimator time and     is the estimator error. For sensors        regression techniques. More complex regression models were also
the estimator time is simply the time of measurement. The                    studied but the added benefit for using more natural, logarithmic
algorithm that is used in calculating the fused value requires               regression models was negligible; the achieved correlation of
information about the statistical properties of , namely the                   ( ) polynomial models is generally very high for the temporal
expected variance of . Thus, a detailed description of the                   domain of interest (τ < 36h). In the formulation of ( ), the
evaluation of            is given. The fusion service estimates an           measurement station’s capability to predict the measured
aggregate statistical variance measure for each            and these         phenomenon at a distance of (covariance of the two time series)
variance measures are then used for the assignment of averaging              is evaluated.
weights to each (         ). Essentially a large estimated aggregate         3.2 Optimal weight calculation
variance causes the assigned weight to decrease, while the data                 Assuming all data sources to be independent and the estimators
from the more accurate and relevant sources are assigned larger              to be non-biased (           ), an optimal fused value (       )
weights and gain more emphasis in the fusion.                                can be calculated according to [16] given by:
3.1 Variance estimation
   The variance of ,              , is affected by the information               (                ∑         (   )
source’s capability to properly assess the phenomenon of interest.                                                                     (5)
In addition, information about air pollutant concentrations and
weather conditions loses accuracy rapidly as a function of the               where individual weights           is given by
temporal interval between the measurement time and the time of
interest defined by the user. Furthermore, a data point near
should always get a larger weight in the fusion in contrast to other                      ∑                                            (6)
data points that describes the conditions in more remote locations.



                                                                        33
To assure statistical independence of        .. , only the most                population evaluation radii; the best correlation was achieved
relevant estimator      per data source is selected for the fused              with the abovementioned values (land-use with a 200m radius,
value calculation in Eq. 5. If a collection of estimators                      population density with a 6km radius). Nevertheless, this
{ (       )      (      )} is available from the same source, the              mathematically intensive regression procedure is not discussed in
selected     to represent the source is simply the one with the                this paper further although for the NO2 pollutant, a demonstration
lowest            from the collection. In the particular case for              of the profiling method and its capability to predict the expected
extracted time series from measurement stations, the estimator                 hourly concentration is presented in section 4.1.
which has the smallest is selected to represent the source, as
and the base variance are the same for all .. .
   Theoretically, it can be shown that the fused value (      ) is
the optimal estimator in terms of mean squared error and that the
prediction accuracy increases while the number of independent
data sources (n) is increased [16]. More importantly,      (     )
does not suffer from low quality input data, as long as         in
Eq. 2 has been estimated reasonably well.
3.3 Bias correction
   In the algorithm presented in section 3.2, it was assumed that
each      is an unbiased estimator for the conditions in        at the
time     . Local air quality measurements from a different
environment, however, are usually significantly biased estimators
for the conditions in other nearby environments. Moreover, the                   Figure 3: Profile evaluation with land use and population
hour of day may even contribute to the bias (consider a                         density maps. The larger circle represents the area for local
measurement station near a busy road during the morning traffic).                   population determination and the smaller red circle
Thus, in order to use Eq. 5 effectively, the fusion service utilizes a         represents the area for land use determination. Satellite image
geographic profiling feature to detect and automatically remove                                  provided by Google Earth.
this kind of structural bias from the estimators. The fusion service              As discussed in at the beginning of section 2.1 the fusion
was incorporated with high-resolution land use and population                  service stores measurements as evaluation material for individual
density masks for Finland (the selected domain for the PESCaDO                 service providers and models. Thus for another completely
prototype). For land-use, a dataset from CORINE with a                         different region other than Finland, the regression parameters for
resolution of 50m x 50m is being used. For population density                  profiling can be set without a fixed set of calibration material; the
data (for 2010), the fusion service has the prototype domain                   stored measurements that have flown through the PESCaDO
covered with a resolution of 250m x 250m. These two data                       system can be further exploited by setting up the regression
sources are used for profiling and comparing the differences                   parameters for profiling automatically as the number of
between the environments in and and ultimately, (                  ) is        measurements builds up over time. In this sense the profiling
polished into a non-biased estimator for (           ). The profiling          feature within the Fusion Service is adaptive.
is done as follows:                                                               The presented bias correction method offers yet another
      -     The surrounding land use (with evaluation radius of                advantage: episodes that affect air quality on a major scale, such
            200m) and population density (a wider evaluation                   as forest fires, are automatically accounted for if the input data
            radius of 6km) for both and is evaluated.                          contains some measurements from the episode-driven locations.
      -     The evaluated environment is expressed as a collection             For instance, if a background station has measured an
            of selected land-use frequencies and population density.           exceptionally high concentration of NO2, then the expected NO2
            This collection is referred to as a profile in this paper          concentration at a nearby urban environment is going to be
            (Fig 3).                                                           reflected on the episode-affected background concentration.
   After the evaluation of profiles, the difference between the
expected values is evaluated. Let (        ) be the estimator profile
and (         ) be the evaluated profile corresponding to the user             4. RESULTS
defined location and time. Then, a bias corrected estimator                       The performance of the presented environmental information
   (     ) is given by                                                         fusion method was evaluated using temperature forecasts
                                                                               provided by four well known weather service providers (FMI,
  (     )     (     )   (      (    )         (      )    (7)                  SMHI, Met Norway and Weather Underground). For 43 locations
where        (    ) is the expected hourly concentration of the                around Finland weather forecasts were extracted from respective
pollutant at the estimator’s location at time and      (     ) is              online sites and stored during several months in 2012. Uncertainty
                                                                               metrics in terms of             (     ) for individual SPs were
the expected pollutant concentration in the user defined location
at the time .                                                                  evaluated by comparing measured temperature values against
   The evaluation of Eq. 7 requires yet another statistical model              individual stored forecasts for each SP; a total of 2500 forecasted
(for each pollutant) to calculate the expected concentration as a              versus measured temperature -pairs for each SP were gathered in
                                                                               order to get statistically meaningful         (      ) estimates as
function of time and key land-use frequencies. Such a set of
statistical models has been implemented with the fusion service,               a function of forecasted period length. Then, fused forecasts
using the archived measurement time series in Finland as                       (temperature of the next 3 days) for the locations in August 2012
calibration data: the environments around the stations were                    were produced on a daily basis for each of these locations using
evaluated and multi-variable regression was applied. The                       the stored forecasts.
regression was repeated with several different land-use and


                                                                          34
   In Figure 4, the mean absolute error of temperature forecasts
and the fused forecast is presented. According to the figure fused
temperature forecasts have the lowest mean error with just four
different SPs providing forecasts simultaneously. This result goes
to show that the well-known benefits of forecast fusion can be
exploited within web services such as PESCaDO when the
performance of forecast providers is being monitored.




                                                                                 Figure 5a-h: Predicted and observed hourly average
 Figure 4: Mean absolute error of temperature (C) forecasts                     concentration of NO2 during working days (Monday to
   and the fused forecast for different forecast time spans.                  Friday) in several measurement sites. Predicted values have
 Forecasted and measured data for 43 different locations and                  been obtained by evaluating the station’s environment with
              time periods in august was used.                                               the aid of the profiling feature.
4.1 Performance of the environmental
profiling feature
   The environmental profiling feature of the fusion service was
calibrated using measurement time series from Finland during
2010. To test the performance of this novel feature, 8 different
NO2 measurement stations with varying environments were
selected in 2011, and the observed hourly concentrations were
compared against the values predicted with the aid of the profiling
feature. The profiling feature differentiates working days and
weekends and for this test, the working days were selected.
   It can be seen from the figures 5a-h that the profiling feature is
able to predict the expected average NO2 concentration well in
various different environments. Background areas, urban and
rural, fare better in the comparison while the traffic-intense
environments are more difficult to predict. This is to be expected
as the actual traffic volumes have to be derived using only the
                                                                               Figure 6: Fused NO2 concentration in Southern Finland in
local population and road intensity. As a consequence, the
                                                                                                    2011 at 07:00.
profiling feature inevitably underestimates the expected
concentration near large motorways that have a small surrounding                The highest concentration can be found at the centre of
population.                                                                  Helsinki, which resides in the bottom-right corner of the figure.
                                                                             The remote test area is a small city centre (Lohja), located
4.2 Comparison of measured and predicted                                     approximately 70 kilometres to the right of Helsinki – 50
NO2 time series                                                              kilometres away from the nearest measurement station. The fused
   The performance of the fusion of air quality measurements                 values were compared against the on-site measurements in the
with the presented methodology was tested with NO2                           test area and results are shown in Figure 7.
measurements in Southern Finland. Measurement time series for                   The comparison between fused and measured NO2
February 2011 from the available stations (n = 20) were used as              concentration at the test site (Figure 7) shows that the pollutant
input data and fused NO2 concentrations were calculated for a                concentration has been estimated fairly accurately with the
remote location for which comparison time series was readily                 presented method.
available. The domain for the test can be seen from Figure 6                    During the study period the mean absolute error between
which illustrates the fused concentration of NO2 at one of the               predicted and measured NO2 hourly concentration was of the
hours of interest.                                                           order of 7 µg/m3 (mean = 12µg, Var = 107 µ2g2). This error is
                                                                             significantly less than the achieved mean error when a
                                                                             conventional geographical extrapolation method would be used:



                                                                        35
using inverse distance weighting (IWD), [11] the resulting mean            5. CONCLUSION
absolute error would be 14 µg/m3.
                                                                              To provide timely meteorological and air quality related
                                                                           information to citizens and administrative user alike, a prototype
                                                                           service PESCaDO was developed. By combining the data
                                                                           discovery, extraction and fusion methods, described in this paper,
                                                                           it possible to produce accurate and personalized information to
                                                                           the users. Unlike several search engines, the user is not confused
                                                                           by the sheer amount of presented data and suggestions; instead,
                                                                           the user is provided with a single, understandable yet precise
                                                                           answer. This is also what separates PESCaDO from a
                                                                           conventional, generic search engine. The self-maintaining design
                                                                           of PESCaDO system facilitates the discovery and indexing of
                                                                           new information sources. The source provider’s performance can
                                                                           be evaluated and stored on a continuous basis and the stored
                                                                           performance data can be used to guide the fusion of information.
                                                                           Furthermore, the measured air quality and meteorological data
                                                                           that flows through the system can be used in the calibration of the
  Figure 7: The observed and predicted NO2 concentration                   fusion service’s various statistical models effectively allowing the
during February 2011 at the test site, the centre of Lohja city.           system to adapt into different regions.
                                                                              The fusion method offers several advantages for the PESCaDO
   Figure 8 illustrates a collection of mean absolute prediction
                                                                           system. For instance, it is not necessary to discard any extracted
errors from calculations similar to the one presented in Fig 7. One
                                                                           information as the algorithm takes care that the irrelevant input is
by one, the measurement stations were removed from the input
                                                                           not over-emphasized. In this paper, a demonstration of the fusion
data and the removed time series was compared against the fused
                                                                           of temperature forecasts was given. It was shown that the fused
time series which was produced using the remaining data.
                                                                           temperature forecast in fact had the lowest margin of error, which
According to Fig 8 if the locations for near-by measurements
                                                                           goes to show the benefits to be had in the fusion of information
represents similar environment than the location for IWD
                                                                           even if the amount of service providers is small.
extrapolation (Laune station, Tikkurila station of Fig 8), then the
                                                                              It was shown that the presented profiling feature of the fusion
IWD extrapolation may be able to predict the hourly
                                                                           service is able to predict hourly concentrations of NO2 in different
concentration fairly well. Otherwise, the IWD-method without
                                                                           environments quite well. As a consequence, the fusion method
bias correction capabilities produces generally poor estimates in
                                                                           was able to outperform a conventional extrapolation method
terms of mean absolute error whereas the fusion service performs
                                                                           (IWD). However, NO2 is strongly affected by urbanization and
well regardless of the collection of estimators used as input.
                                                                           road traffic and thus is an ideal phenomenon to be handled with
Indeed, Luukki station, a rural NO2 background measurement
                                                                           the proposed fusion method. Other pollutants however, such as
station is an example of this; there are several urban measurement
                                                                           ozone and carbon monoxide are more difficult to handle with the
stations nearby and thus the hourly concentration of NO2 in
                                                                           presented profiling feature. In fact, the static environment based
Luukki cannot be extrapolated with conventional methods.
                                                                           bias-removal needs to be more dynamic in the future. This could
                                                                           be achieved by introducing meteorology in the fusion process. For
                                                                           instance, the profile could be analysed from the wind’s direction.
                                                                           Furthermore, the expected concentration could be a function of
                                                                           several meteorological parameters such as rain, sky conditions
                                                                           and wind speed. As a result, the PESCaDO system would be
                                                                           orchestrated in another new level, where the extracted
                                                                           meteorological data would be subject to fusion and used again in
                                                                           the fusion of air quality pollutants.

                                                                           6. ACKNOWLEDGMENTS
                                                                              This work was supported by the European Commission under
                                                                           the contract FP7-ICT-248594 (PESCaDO).

                                                                           7. REFERENCES
                                                                           [1] Balk, T., Kukkonen J., Karatzas, K., Bassoukos, A., and
     Figure 8: Comparison of IWD extrapolation and the                         Epitropou, V., European Open Access Chemical Weather
   presented fusion method in terms of standard deviation.                     Forecasting Portal, Atmospheric Environment, 38(45),
     Observed average describes the average hourly NO2                         6917–6922, 2011.
             concentration at measurement site.                            [2] Bassoukos A., Karatzas K., Kelemis A. (2005)
                                                                               Environmental Information portals, services, and retrieval
                                                                               systems, Proceedings of of “Informatics for Environmental
                                                                               Protection- Networking Environmental Information”-19th




                                                                      36
     International EnviroInfo Cenference, Brno, Czech Republic,                 Base Access Service. Proceedings of the 23rd International
     pp. 151-155.                                                               Workshop on Database and Expert Systems Applications.
[3] Epitropou, V., Karatzas, K., Bassoukos, A., Kukkonen, J.                    2012.
    and Balk, T., A new environmental image processing                     [14] Oyama, S., Kokubo, T., Ishida, T.: Domain-Specific Web
    method for chemical weather forecasts in Europe,                            Search with Keyword Spices Awareness in Urban Areas. J.
    Proceedings of the 5th International Symposium on                           IEEE Transactions on Knowledge and Data Engineering. 16
    Information Technologies in Environmental Engineering.                      (1), 17—24, 2004
    Poznan: Springer Series: Environmental Science and                     [15] Pianta, E., & Tonelli, S. KX: A Flexible System for
    Engineering, 781–791, 2011.                                                 Keyphrase Extraction. Proceedings of SemEval, 2010.
[4] Hoek, G., Beelen, R., Hoogh, K., Viennau, D., Gulliver, J.,            [16] Potempski, S. and Galmarini, S., Est modus in rebus:
    Fischer, P. and Birggs, D. A review of land-use regression                  analytical properties of multi-model ensembles, Atmos.
    models to assess spatial variation of outdoor air pollution.                Chem. Phys., 9, 9471–9489, doi:10.5194/acp-9-9471-
    Atmospheric Environment 42 (2008) 7561–7578,                                2009,2009,
    doi:10.1016/j.atmosenv.2008.05.057. 2008.
                                                                           [17] Sikora, T. The MPEG-7 visual standard for content
[5] Janssen, S., Gerwin, D., Fierens, F. and Mensink, C. Spatial                description-an overview. IEEE Transactions on Circuits and
    interpolation of air pollution measurements using CORINE                    Systems for Video Technology, 11(6), pp. 696-702, 2001
    land cover data. Atmospheric Environment,Volume 42, Issue
    20, June 2008, Pages 4884–4903, 2008.                                  [18] Tang, T. T., Hawking, D., Craswell, N., & Sankaranarayana,
                                                                                R. S. Focused crawling in depression portal search: A
[6] Karatzas K. A fuzzy logic approach in Urban Air Quality                     feasibility study. Proceedings of the 9th Australasian
    Management and Information Systems (UAQMIS),                                Document Computing Symposium, Melbourne, Australia,
    Proceedings of the 4th International Conference on Urban                    2004.
    Air Quality Measurement, Modelling and Management (R.
    Sokhi and J. Brexhler eds), Charles University, Prague,                [19] Wald, L., Some terms of reference in data fusion, IEEE
    Czech Republic, 25-27 March 2003, pp. 274-276, 2003                         Transactions on Geosciences and Remote Sensing 37(3), pp.
                                                                                1190-1193, 2001.
[7] Karatzas K. Informing the public about atmospheric quality:
    air pollution and pollen, Allergo Journal 18, Issue 3/09, pp           [20] Wanner, L., Vrochidis, S., Tonelli, S., Mossgraber, J.,
    212-217, 2009                                                               Bosch, H., Karppinen, A., Myllynen, M., Rospocher, M.,
                                                                                Bouayad-Agha, N., Bügel, U., Casamayo,r G., Ertl, T.,
[8] Karatzas, K. and Kukkonen, J., COST Action ES0602:                          Kompatsiaris, I., Koskentalo, T., Mille, S., Moumtzidou, A.,
    Quality of life information services towards a sustainable                  Pianta, E., Saggion, H., Serafini, L., and Tarvainen, V,.
    society for the atmospheric environment, ISBN: 978-960-                     Building an Environmental Information System for
    6706-20-2, Thessaloniki: Sofia Publishers, 2009.                            Personalized Content Delivery. In (Hrebícek J., Schimak G.,
[9] Κlein Τh., Kukkonen J., Dahl Å., Bossioli E., Baklanov A.,                  Denzer R. eds.): Environmental Software Systems.
    Fahre Vik Α., Agnew P., Karatzas, K., and Sofiev, M.,                       Frameworks of eEnvironment - 9th IFIP WG 5.11
    Interactions of physical, chemical and biological weather                   International Symposium, Proceedings. IFIP Publications
    calling for an integrated assessment, forecasting and                       359, Springer, ISBN 978-3-642-22284-9, pp. 169-176, 2011.
    communication of air quality, AMBIO,, 41(8), pp. 851-864,              [21] Wanner L., Vrochidis S., Rospocher M., Moßgraber J.,
    2012                                                                        Bosch H., Karppinen A., Myllynen M., Tonelli S., Bouayad-
[10] Kukkonen, J., Olsson, T., Schultz, D.M., Baklanov, A.,                     Agha N., Casamayor G., Ertl Th., Hilbring D., Johansson L.,
     Klein, T., Miranda, A. I., Monteiro, A., Hirtl, M., Tarvainen,             Karatzas K., Kompatsiaris I., Koskentalo T., Mille S.,
     V., Boy, M., Peuch, V.-H., Poupkou, A., Kioutsioukis, I.,                  Moumtzidou A., Pianta E., Serafini L. and Tarvainen V.
     Finardi, S., Sofiev, M., Sokhi, R., Lehtinen, K. E. J.,                    Personalized Environmental Service Orchestration for
     Karatzas, K., San José, R., Astitha, M., Kallos, G., Schaap,               Quality Life Improvement, 8th IFIP WG 12.5 International
     M., Reimer, E., Jakobs, H., and Eben, K., A review of                      Conference, AIAI 2012 Workshops, IFIP AICT 382 (L.
     operational, regional-scale, chemical weather forecasting                  Iliadis et al., eds), Proceedings, Springer, pp.351-360., 2012
     models in Europe, Atmos. Chem. Phys (12), 1-87,                       [22] Weigel, A.P, Liniger, M.A. and Appenzeller, C. Can multi-
     doi:10.5194/acp-12-1-2012, 2012.                                           model combination really enhance the prediction skill of
[11] Li, J. and Heap, A.D., 2008. A Review of Spatial                           probabilistic ensemble forecasts?. QUARTERLY
     Interpolation Methods for Environmental Scientists.                        JOURNAL OF THE ROYAL METEOROLOGICAL
     Geoscience Australia, Record 2008/23, 137 pp, ISBN 978 1                   SOCIETY. Q. J. R. Meteorol. Soc. 134: 241–260, 2008
     921498 30 5.                                                          [23] Wöber, K. Domain Specific Search Engines, In: Fesenmaier,
[12] Moumtzidou, A., Vrochidis, S., Tonelli, S., Kompatsiaris, I.,              D. R., Werthner, H., Wöber, K. (eds.) Travel Destination
     & Pianta, E. (2012). Discovery of Environmental Nodes in                   Recommendation Systems: Behavioral Foundations and
     the Web", Proceedings of the 5th IRF Conference, Vienna,                   Applications, 205—226. Cambridge, MA: CAB
     Austria, 2012.                                                             International, 2006.
[13] Moßgraber, J., Rospocher, M. Ontology Management in a
     Service-oriented Architecture. Architecture of a Knowledge




                                                                      37