<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>QBOAirbase: The European Air Quality Database as an RDF Cube</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Aalborg University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Airbase is the European air quality dataset maintained by the Environmental European Agency. The dataset is available on the Web, and contains air quality monitoring data for 40 European countries. The multidimensional nature of the data makes it a good t for OLAP (Online Analytical Processing) systems. Moreover, by linking the data to the Semantic Web, we can magnify its value, allowing for more sophisticated data analytics. In this paper, we introduce and describe QBOAirbase, a multidimensional provenance-augmented version of the Airbase dataset. QBOAirbase models air pollution data as an RDF cube, which has been linked to the YAGO and DBpedia knowledge bases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The Airbase1 is the European air quality dataset maintained by the EEA
(Environmental European Agency). The dataset is available on the Web, and contains
air quality monitoring data for 40 European countries. The numerical and
multidimensional data contained in the Airbase dataset can be e ciently handled
by Online Analytical Processing systems (OLAP). Such systems are common in
data warehousing or business intelligence scenarios, and are optimized to handle
complex aggregation queries on multidimensional data with rare updates.
Multidimensional datasets, known as data cubes, consist of a set of observations, e.g.,
measurements of the concentration of an air pollutant. These observations are
described in terms of coordinates in a set of dimensions, e.g., time or location.
Observations are the target of OLAP applications.</p>
      <p>
        OLAP applications can bene t considerably from RDF and Linked Data [1{3,
8]. Thanks to the Linked Open Data initiative2, resources from di erent datasets
have been interlinked when they refer to the same real-world concept. Such a
network of datasets constitutes what we call the Semantic Web, and allows us to
see the Web as a giant knowledge base that can be queried, \understood", and
analyzed by software agents. The interest in OLAP on the Semantic Web has
been thrusted by the support for aggregation queries introduced in SPARQL 1.1
{the query language for RDF{, and the publication of the QB vocabulary [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
to model multidimensional data in RDF. This has motivated the publication of
several datasets as RDF cubes3.
      </p>
      <p>
        Keeping track of the origin of the data is crucial in a setting with multiple
independent data providers. The provenance of a fact in an RDF data collection
is metadata about the source and the data transformations that led to the
publication of that fact. Such metadata nds application in scenarios such as data
fusion or access control [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ]. While some RDF datasets have been augmented
with provenance information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], these still constitute a minority.
      </p>
      <p>
        We present QBOAirbase, a multidimensional, linked and
provenanceaugmented version of the Airbase dataset in RDF. QBOAirbase represents
air pollution data as a three-dimensional multilevel cube modeled with
QB4OLAP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] {an extension of QB that allows for multilevel dimensions. By
linking QBOAirbase to YAGO [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and DBpedia [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we enrich the Airbase dataset
with further information about the cities and countries of the monitoring
stations, and the air pollutants. This opens the door to more sophisticated use cases
in the analysis of air pollution data. We elaborate on the design of QBOAirbase
in the following.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>QBOAirbase</title>
      <sec id="sec-2-1">
        <title>2.1 The Airbase Dataset</title>
        <p>
          QBOAirbase is built upon version 8 of the Airbase dataset1. The original dataset
is a collection of CSV and XML les containing annual concentration
measurements for 238 air pollutants (e.g., SO2, PM10) in 40 European countries from
years 1969 to 2012. Besides the actual measurements, the les also contain
information about the data providers and the monitoring stations. For the latter, this
includes the station's geographic location and technical details about the sensors'
con gurations. The data is accessible via a SPARQL endpoint4. Unlike
QBOAirbase, this dataset is modeled with QB [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] instead of QB4OLAP [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and provides
neither provenance information nor links to existing knowledge bases.
Furthermore, the SPARQL endpoint does not o er detailed documentation about the
RDF schema and how to query the data.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Cube Structure</title>
        <p>Observations. In QBOAirbase an observation maps to a measurement in the
original Airbase dataset, that is, the aggregation of a set of measurements for a
single air pollutant in an annual time span. QBOAirbase includes measurements
for a list of 15 pollutants out of the 238 present in the original dataset. This
is the minimal list of pollutants that a country must measure according to EU
regulations. The original dataset considers 20 aggregation functions such as the
annual mean, the maximum, the 50th percentile, among others. They are all
considered in QBOAirbase.</p>
        <p>Dimensions. An observation is characterized by its coordinates in the year,
station, and sensor dimensions as Figure 1 shows. The edges in the gure de ne
the schema properties, i.e., the RDF properties that connect the di erent levels
of the cube structure. The station dimension contains three levels: station, city,
and country. For some stations we did not have access to the information of the
4 http://semantic.eea.europa.eu/sparql</p>
        <p>
          QBOAirbase: The European Air Quality Database as an RDF Cube
city, hence those stations can only be rolled-up to the country level. We have
manually linked the cities and countries in QBOAirbase to YAGO [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and
DBpedia [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The sensor dimension was arti cially introduced and consists of two
levels: sensor and component. The sensor level represents a sensor con guration
and is described by a measurement unit, a type of equipment, a technique
principle of the equipment, an aggregation function, etc. A sensor can be rolled-up to
a component, which corresponds to an air pollutant (e.g., NO2). The pollutants
associated to the components have been manually linked to their
corresponding YAGO and DBpedia resources as we did for cities and countries. Both the
sensor dimension and the distinction between sensor and component allow us to
model the fact that a station can provide measurements for the same pollutant
under di erent sensor con gurations, e.g., using a di erent aggregation function
or measurement unit. Sensors and pollutants are instances of the classes
Sensor and Property de ned in the Semantic Sensor Network Ontology (SSN) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
SSN is a W3C candidate recommendation to describe sensors and measurement
procedures.
        </p>
        <p>air:schema/inCountry
air:schema/station</p>
        <p>Station
air:schema/inCity</p>
        <p>City
air:schema/locatedIn Country
float
air:schema/measure
Year
air:schema/year</p>
        <p>Observation
air:schema/sensor Sensor air:schema/component</p>
        <p>Component</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Provenance</title>
        <p>
          Each RDF triple in QBOAirbase is augmented with its work ow provenance [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
In this model each triple is assigned an RDF resource, which we call its
provenance entity. Such an entity models the processes that led to the generation of
the triple, and is described with the W3C speci cation PROV-O [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. A
provenance entity can represent the source of a given measurement (e.g., a le), a
schema mapping, or the result of a data operation such as a data extraction or
a join. A data operation is modeled as an activity in PROV-O. Activities can
produce or depend on provenance entities and they can directly or indirectly be
carried by agents. These can be people, organizations, or even software agents.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Applications and Outlook</title>
      <p>The Airbase's website provides an extensive list of reports in the form of
gures, tables, interactive maps, and visualizations. Most of the reports can be
generated by OLAP operations on the available data. However, some reports
require additional data not present in the dataset. One example is the percentage
of urban population resident in areas where pollutant concentrations are higher
than the recommend limit values5, which requires the population of the cities and
5 https://www.eea.europa.eu/data-and-maps/figures/urban-population-resident-in-areas-pollutant-limit-target
the recommended concentration values of the pollutants. Such information can
be obtained from another knowledge base. This justi es our decision of linking
QBOAirbase to other sources in the Semantic Web. Moreover, the work ow
provenance gives users more control on the data by e.g., restricting reports to
data coming from particular institutions.</p>
      <p>QBOAirbase is publicly available at http://qweb.cs.aau.dk/qboairbase.
We also provide a SPARQL endpoint and a detailed documentation of the cube
and provenance design. As Airbase, QBOAirbase is released under the terms of
the ODC Open Database License (ODbL).</p>
      <sec id="sec-3-1">
        <title>Acknowledgments</title>
        <p>This research was partially funded by the Danish Council for Independent
Research (DFF) under grant agreement No. DFF-4093-00301.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Abello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Berlanga</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Nebot</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Aramburu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          .
          <article-title>Using Semantic Web Technologies for Exploratory OLAP: A Survey</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Kim Ahlstr m, Katja Hose, and Torben Bach Pedersen.
          <article-title>Towards Answering Provenance-Enabled SPARQL Queries Over RDF Data Cubes</article-title>
          .
          <source>In JIST</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Alex</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Andersen</surname>
          </string-name>
          , Nurefsan Gur, Katja Hose, Kim A.
          <string-name>
            <surname>Jakobsen</surname>
          </string-name>
          , and Torben Bach Pedersen.
          <article-title>Publishing Danish Agricultural Government Data as Semantic Web Data</article-title>
          .
          <source>In JIST</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Tyrone</given-names>
            <surname>Cadenhead</surname>
          </string-name>
          , Vaibhav Khadilkar, Murat Kantarcioglu, and
          <string-name>
            <given-names>Bhavani</given-names>
            <surname>Thuraisingham</surname>
          </string-name>
          .
          <article-title>A Language for Provenance Access Control</article-title>
          .
          <source>In CODASPI</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Lorena</given-names>
            <surname>Etcheverry</surname>
          </string-name>
          and
          <article-title>Alejandro A. Vaisman. QB4OLAP: A Vocabulary for OLAP Cubes on the Semantic Web</article-title>
          .
          <source>In COLD</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Hartig</surname>
          </string-name>
          .
          <article-title>Provenance Information in the Web of Data</article-title>
          .
          <source>In LOWD</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Jens</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
          <string-name>
            <given-names>Pablo N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef,
          <article-title>Soren Auer, and Christian Bizer. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Adriana</given-names>
            <surname>Matei</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kuo-Ming Chao</surname>
            , and
            <given-names>Nick</given-names>
          </string-name>
          <string-name>
            <surname>Godwin</surname>
          </string-name>
          .
          <article-title>OLAP for Multidimensional Semantic Web Databases</article-title>
          .
          <source>In BIRTE</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pablo</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mendes</surname>
            , Hannes Muhleisen, and
            <given-names>Christian</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
          </string-name>
          . Sieve:
          <article-title>Linked Data Quality Assessment and Fusion</article-title>
          . In EDBT/ICDT Workshops,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fabian M. Suchanek</surname>
            , Gjergji Kasneci, and
            <given-names>Gerhard</given-names>
          </string-name>
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>YAGO: A Core of Semantic Knowledge</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Yannis</surname>
            <given-names>Theoharis</given-names>
          </string-name>
          , Irini Fundulaki, Grigoris Karvounarakis, and
          <string-name>
            <given-names>Vassilis</given-names>
            <surname>Christophides</surname>
          </string-name>
          .
          <source>On Provenance of Queries on Semantic Web Data. IEEE Internet Computing</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Word</surname>
          </string-name>
          <article-title>Wide Web Consortium. The RDF Data Cube Vocabulary</article-title>
          . https://www. w3.org/TR/vocab-data-cube/,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Word Wide Web Consortium.
          <article-title>Semantic Sensor Network Ontology, W3C Candidate Recommendation</article-title>
          . https://www.w3.org/TR/vocab-ssn/,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. World Wide Web Consortium.
          <article-title>PROV-O: The PROV Ontology</article-title>
          . http://www.w3. org/TR/2013/REC-prov-o-
          <volume>20130430</volume>
          /,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>