<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dealing with Diversity in a Smart-City Datahub</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mathieu d'Aquin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Adamou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Daga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuangyan Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keerthi Thomas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Open University</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the data curation approach taken by the MK:Smart project, creating a large data repository of datasets about all aspects of the city of Milton Keynes in the UK and its citizens. The issues faced here, which we believe will become more and more common to large, data-centric smart-cities initiatives, is the one associated with the diversity of these thousands of datasets in terms of the licenses, policies and terms they are associated with them. We describe this repository of datasets, the MK Datahub, and its architecture to create data work ows from original sources to applications. We focus on the approach taken to record, in a structured, ontology-based way the components of the licenses and policies of each dataset, as well as the tools we are developing to manage such representations and to reason with them.</p>
      </abstract>
      <kwd-group>
        <kwd>Data curation</kwd>
        <kwd>licenses</kwd>
        <kwd>policies</kwd>
        <kwd>ODRL</kwd>
        <kwd>smart-city</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data and the intelligent processing of data is at the center of smart-city
initiatives, whether they rely on the simple publication of open data by local
authorities (see for example [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) or on more complex systems including big data and big
computation (see for example [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]). Once a certain scale is reached, such projects
have to face not only the amount of data to store, process and deliver, but also
their variety. Indeed, they need to deal with data regarding many di erent
aspects of the functioning of the city, its inhabitants, services, infrastructures, etc.
Semantic technologies obviously have a role to play in dealing with this variety,
supporting the integration of data in a meaningful way accross sources.
      </p>
      <p>However, beyond dealing with semantic interoperability, there is a need to
manage the number and diversity of these datasets. Indeed, in projects such as
MK:Smart1, which plan to deal with several thousands of datasets, the curation
process of these datasets needs to be streamlined, to avoid for the system and
its operators to become overwhelmed with having to deal with the mechanisms</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://mksmart.org</title>
      <p>associated with the import, management and redistribution of data from diverse
sources, in diverse formats, having diverse requierements for storage and update,
and most importantly, being associated with diverse policies, licenses and terms
of use, which need to be instantiated and managed across the whole work ow of
the information services associated with the data repository.</p>
      <p>In this paper, we describe the initial approach taken by the MK:Smart project
to deal with this diversity, with the intent to create a datahub that scales not
only in the amount of data it can hold, but in the number of di erent situations
which dimensions a ect the e ort associated with curating the datasets and
managing the associated work ows. We rst give an overview of the MK:Smart
project and discuss the building of a large scale datahub to store and re-deliver
data from a large variety of sources. We then present the approach currently
being deployed in this project, as part of this datahub, to deal with the diverse
set of policies, terms and licenses attached to these datasets.
2</p>
      <sec id="sec-2-1">
        <title>The MK:Smart Project and the MK Datahub</title>
        <p>In this section, we brie y introduce the MK:Smart project in Milton Keynes
(UK), the role of the large data repository which the project is creating (the
MK Datahub) and how this repository will support the import, integration and
delivery a large amounts of data comning from a large number of sources.
2.1</p>
        <sec id="sec-2-1-1">
          <title>The MK:Smart Project</title>
          <p>Milton Keynes (MK) is a new town in Buckimghamshire, which has been o
cially designated in 1967. It is now one of the fastest growing cities in the UK,
with a population of approximetly 230,000 inhabitants. However, the challenge of
supporting sustainable growth without exceeding the capacity of the
infrastructure, and whilst meeting key carbon reduction targets, is a major one. MK:Smart
is a large collaborative initiative, partly funded by HEFCE (the Higher
Education Funding Council for England) and led by The Open University, which aim
to develop innovative solutions to support economic growth in MK.</p>
          <p>Central to the project is the creation of a state-of-the-art \MK Datahub"
which will support the acquisition and management of vast amounts of data
relevant to city systems from a variety of data sources. These will include data
about energy and water consumption, transport data, data acquired through
satellite technology, social and economic datasets, and crowdsourced data from
social media or specialised apps.</p>
          <p>On this basis, the project is looking to support the creation of new sensor and
analytics applications and services, putting a particular emphasis on the areas
of transport, energy and water. Indeed, while MK was planned out to support
tra c relativly e ciently compared to other towns, mobility is critical to
economic activities and intelligent, hybrid transportation systems that intelligently
facilitate mobility within the area are needed. Also, with the constant growth of
the area in terms of population, there is a risk that the current infrastructures
for the delivery of water and energy will reach their limit capacity in the future.
Smart technologies here are thought to be used to improve energy and water e
ciency, reducing consumption (as well as making generation more e cient) both
through data analytics applied at the level of the infrastructure, and through
supporting citizens in changing their behaviour.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>The MK Datahub and Data Work ows</title>
          <p>The MK Datahub is the central IT and data management infrastructure of the
MK:Smart project. It is based on the simple idea that innovative solutions in
the three domains mentioned above, and in many others, will require data that
can be drawn from a large varietry of sources. The idea is therefore to build
a common facility to e ciently manage, integrate and re-deliver such data for
applications and services to rely on, reducing development costs for all of these
applications, and enabling intelligent data processing mechanisms (mining,
analytics, aggregation, alignment, linking) at the scale of the entire city, in a common
data infrastructure.</p>
          <p>The MK Datahub is made out of several layers. It relies on a hardware
infrastructure physically located in MK, but which is then con gured similarly to
a private cloud (using common virtualisation technologies, and cloud
orchestration services) to enable optimal use by the layers above, as well as to meet the
(changing) requirements of applications. The data management layer is the main
focus of this paper, and is being described in more details below. Its role is to
implement the work ow that connect data in their original sources to
applications that might want to exploit these data. Finally, additional development and
service management mechanisms sit on top of the data management layer, for
the management of the users/customers of the datahub, and of the developers
of applications.</p>
          <p>An overview of the architecture of the data management layer of the MK
Datahub is shown in Figure 1. This architecture is itself based on three main
layers: The data import layer, the storage layer and the data delivery layer.
Starting from the middle, one of the choices made in the design of this
architecture is not to rely on one single type of storage technology, but to implement
data storage in a distributed and hybrid way (RDMSs as well as graph-based
and triple stores). The reason for this choice is rst that maintenance and
management of smaller storage components distributed amongst dedicated servers is
easier and more robust than it would be with a unique warehouse when talking
about thousands of datasets, the sources of most of them being out of our
control. Also, the development cost of import pipelines are reduced when we can
choose the most appropriate storage format amongst a number of options for
each data feed to be considered.</p>
          <p>The main goal of the bottom layer is to create the pipelines to import each
of the thousands of data sources to be included in and re-delivered through the
datahub. Here already, the challenge is in the diversity of these data sources.
Indeed, each might rely on a di erent format, a di erent mode of transfer, have
di erent constraints attached to it, a di erent update rate, etc. The challenge
here is therefore to create a data processing pipeline framework which allows
at the same time the creation of ad-hoc processing pipelines for each of the
considered data sources, but also the reuse of processing components and their
automatic invocation to reduce their development and maintenance costs. The
approach taken here is to create a generic wrapper for data processing services
that can be distributed amongst the various (internal) processing servers
available to the datahub. Each processing service is developed independetly, but uses
standard communication mechanisms (here, we rely on Apache ActiveMQ2) as
well as conventions for their naming and the naming of communication
channels (queues and topics in ActiveMQ) to facilitate the creation of simple data
import pipeline as sequences of components that can be executed automatically
by a scheduler. The scheduler, takes information from the data catalogue of the
datahub (see Section 3.2 about the data catalogue, especially focusing on
policies and terms) about the original data feed, the pipeline, the storage destination
and the update rate of the data. If decomposed in a su ciently granular way,
generic components developed for one pipeline (e.g. transformation of statistical
data out of a business intelligence tools into RDF Data Cube) can therefore be
reused in the creation of other pipelines, and pipelines are naturally distributed
in the services that implement the components.</p>
          <p>Of course, the \lazy" approach taken in the import and storage layer have
an impact on the top layer regarding re-delivery. Indeed, the goal of this layer
is to provide data to external applications and services in a homogeneous and
convenient way, through both feed-centric and entity-centric (URI
dereferenc2 http://activemq.apache.org/
ing) Web APIs. This can indeed be called a \lazy" approach to data integration
as we here only integrate data at the last step of re-delivery, leaving data feeds
isolated in independent databases in the datahub. Therefore, data integration
in the MK Datahub can be likened to dynamic compilation, where information
about a particular entity (using a global URI) is drawn at runtime from various
databases, put together in a common canonical format using canonical, global
identi ers (URIs) for connected objects, and then transfered to the client in the
speci c format of their choice (RDF, JSON, XML, CSV, etc). The (ongoing)
implementation of this has therefore to face many technical challenges, including
the manipulation of an e cient intermediary format to manipulate results of
distributed queries from many, diverse databases, the caching of these intermediary
results and their update, as well as the tracing of the sources of data included
in each integrated result, in particular to enable the propagation of terms and
policies as discussed in the next section.
3</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Dealing with Diverse Data Sources, with Diverse Policies</title>
        <p>Here, we present the data curation approach thought for the MK:Smart project,
focusing in particular on the management of the various terms and conditions
of use and re-distribution of data that might be associated with each dataset.
From a technology point of view, we take an approach that follows the
principles of structured data and semantic technologies, ensuring the cataloguing and
recording of these terms and conditions in an easily exploitable format. Based
on these cataloguing practices, we set up a process for the reviewing and
updating of the curated datasets that directly feed into the structured representation
of datasets. We also describe the ontology-based approach thought to manage
the propagation of the policy rules that represent these terms and conditions as
datasets are being processed.
3.1</p>
        <sec id="sec-2-2-1">
          <title>Data Cataloguing with Structured Data</title>
          <p>As described above, one of the key challenges for the management and
maintenance of the MK Datahub is the large number of datasets to be included, that
are attached to a very diverse set of policies, that might relate to open data and
open licenses, as much as to commercial Web APIs, corporate data and personal
data. For this reason, establishing a small set of common policies that datasets
would have to adhere to is not feasible, but the complexity of manually handling
all of the di erent policies on a case by case basis would also amount to an
unfeasible task. The approach taken here is therefore to automate, as much as
possible, the recording, tracking and updating of each of the policies attached
to each of the datasets, through relying on a data cataloguing system that
includes an ontology-based, structured representation of the components of these
policies. The idea here, once again, is that many policies from many datasets
will share components which, while not necessaraly expressed in the exact same
way, represent the same rule or constraint (e.g. attribution, no modi cation,
retention for a certain period of time). Therefore, policies that might be unique
to particular datasets might be described using components (rules/constraints)
that are shared with many others.</p>
          <p>
            There has been an increasing interest lately in the structured and semantic
representation of rights, policies and terms for various kinds of objects, from the
ones that focus on speci c types of policies (e.g. privacy policies as in P3P [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ])
to some focusing on speci c types of terms (e.g. open licenses as in CC REL3).
ODRL (the Open Digital Rights Language) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] and the ODRL ontology [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] have
been recently proposed to provide \ exible and interoperable mechanisms to
support transparent and innovative use of digital content in publishing, distribution,
and consumption of of digital media across all sectors and communities. The
ODRL Policy model is broad enough to support traditional rights expressions for
commercial transaction, open access expressions for publicly distributed content,
and privacy expressions for social media."4. ODRL has been used in several
initiatives recently to represent policies attached to datasets, and the ODRL
ontology provides a simple, yet exible vocabulary to represent and manipulate
both high level objects such as policies, as well as more granular components
including rules to represent permissions, prohibitions and duties with respect
to the use, re-use and re-distribution of digital assets. It is therefore an ideal
base to be used to record, keep track and manipulate information regarding the
datasets included in the MK Datahub. As an example, the code below is a
simpli ed representation of the Open Government License (OGL5) used by many
open data providers following the ODRL ontology:
&lt;http://mksmart.org/dc/policy/ogl&gt;
a odrl:Set;
odrl:permission odrl:copy ;
odrl:permission odrl:publish ;
odrl:permission odrl:reproduce ;
odrl:permission odrl:commercialize ;
odrl:duty odrl:attribute.
          </p>
          <p>The requirements for the data cataloguing features of the MK datahub are
therefore that they should support not only the description of the datasets, of
their sources and of their import pipelines, but should also make it possible
to describe the policies attached to the datasets at a granular levels (including
their components) in a way compatible with the ODRL model. While several
platforms exist for data cataloguing (including, most prominently, CKAN6 and
Socrata7), these only support the general description and annotation of datasets,
3 http://wiki.creativecommons.org/CC_REL
4 http://www.w3.org/ns/odrl/2/</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 http://www.nationalarchives.gov.uk/doc/open-government-licence/version/</title>
      <p>2/</p>
    </sec>
    <sec id="sec-4">
      <title>6 http://ckan.org/</title>
    </sec>
    <sec id="sec-5">
      <title>7 http://www.socrata.com/</title>
      <p>without going into the details of the processes to manipulate them and, crucially,
of the attached policies (beyond a link to the corresponding license).</p>
      <p>We therefore developed a data cataloguing system that reproduce similar
features to CKAN, but also includes features for data cataloguers to enter the
details of other dimensions that are not fully supported in CKAN. This system
is being developed as a plugin of the Wordpress content management system
(CMS)8. Wordpress is the most used CMS on the web, and is open source.
Relying on Wordpress meant that we could reuse the basic data model and content
organisation it implements, and reuse the interaction patterns Wordpress users
are familiar with and that have proven e ective through the popularity of the
CMS with both professional web developers and general members of the public.</p>
      <p>The implementation of the MKS Data Cataloguing plugin for wordpress is
mainly centered arround two \custom post types": Datasets and Policies.
Custom post types in wordpress are ways to create posts that are dedicated to the
representation of particular types of objects. In this way, for example, a dataset is
described in a speci c type of posts which can directly reuse the standard
Wordpress post interface for editing its description (see Figure 2). The information
associated with a dataset (and so editable through the Wordpress interface) is
then customised in two ways: By introducing speci c taxonomies and by adding
custom attributes. Taxonomies are used in Wordpress to represent information
attached, as annotations or metadata to the posts. The standard taxonomies
provided by Wordpress for common posts are tags and categories (which we also
use for datasets). However, new taxonomies can also be introduced that
repre</p>
    </sec>
    <sec id="sec-6">
      <title>8 http://wordpress.org</title>
      <p>sent di erent aspects to link the dataset to. Here we use taxonomies to represent
the owner of a dataset (i.e. its source or origin), its orginal format and a \status"
tag to represent the level of inclusion the dataset should have in the datahub.
Interestingly, only using this simple mechanism allows us to reproduce most of
the features of system such as CKAN, since relying on taxonomies makes
standard Wordpress search and browsing functions directly useable on our custom
dataset post type.</p>
      <p>Custom attributes are used for information that requires more expressivity
than a simple taxonomy. For dataset, they are used to represent the data
import pipeline associated with each dataset (Figure 3), as well the link the the
policy/license associated with the dataset (Figure 4).</p>
      <p>Similarly to datasets, policies and licenses are represented using a custom
post type. The Policy post type also includes the basic elds for describing each
policy (e.g. to include the text of the license being described). Four taxonomies
are then used to represent each policy in a way that is compatible with the
ODRL model (see Figure 5 for the interface to edit/create a policy). The \Scope"
taxonomy is used to represent the applicability of the policy. This is especially
useful in cases where the dataset/data source is associated with more than one
policy, for example with one policy for personal use and another for commercial
use. The three other taxonomies (\Permissions", \Prohibitions" and \Duties")
are used to represent the three types of rules that form components of the policies
in the ODRL model. They are pre-populated with \Actions" from the ODRL
ontology (e.g. \Attribute", \Modify", \Copy", etc). Here again, the advantages
of implementing it this way are that it makes it easy for a user to describe a
policy or license reusing elements from these taxonomies that have been used to
describe other policies and licenses, that it is relatively easy for users to describe
the complex objects of policies, and that the long-term use of this system will
facilitate the creation of a large, queriable and searchable \knowledge base" of
policies and rules, and of their connection with thousands of relevant datasets.</p>
      <p>
        The data associated with datasets and policies in the MKS Data Cataloging
plugin are naturally stored in the intermal database of the Wordpress CMS.
However, in order to make these data directly usable for the scheduler and data
integration components of the datahub (see Figure 1), they are also
automatically transformed in RDF and sent to a dedicated triple store using the ODRL
ontology for policies, and a mix of other ontologies (VoID [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], DC [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], Prov-O [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
etc.) for datasets, everytime a user saves an entry of one of the two custom types.
3.2
      </p>
      <sec id="sec-6-1">
        <title>Process to Manage and Update the Data Catalogue</title>
        <p>The previous section describes the technical approach to managing diverse datasets,
associated with diverse policies and licenses. On the basis of the tool created, a
process can therefore be setup to take care of the update and review of the
catalogue when new datasets are being included, and when the policies of existing
datasets have changed, as summarised in Figure 6.</p>
        <p>At the center of this process is the \Data Governance Board", which is in
charge of reviewing the policies and licenses attached to each dataset put
forward as candidate for inclusion. The board is formed of members of the project
management team and of at least one \data cataloguer" who has the speci c
role to input the results of deliberations of the board into the data catalogue,
through the tool presented in the previous section. For every new dataset, three
di erent situations might appear:
1. The dataset is associated with a known license (green arrows): This is the
simplest case, as the license would have already been reviewed and created
in the data catalogue. In this case, the only task is to select the license from
the list when creating the dataset description.
2. The dataset is associated with a license which is not already known (blue
arrows): The new license therefore needs to be reviewed and described within
the data catalogue to be associated with the newly created description of the
corresponding dataset. Describing the license corresponds to rst including
a descriptive text in the catalogue (which might be the whole text of the
license) and second, choosing the set of rules that apply in each of the three
taxonomies of permissions, prohibitions and duties. ODRL already includes
some of the most common actions that can be associated with rules, and
these are pre-loaded into the catalogue (e.g. permission to copy, prohibition
to modify, duty to attribute, etc). Also, the data cataloguer is encouraged,
as much as possible, to reuse existing rules from other licenses at the basis of
the description, so that the work of creating new components is reduced, the
taxonomies remain manageable and the license descriptions are consistent. It
is expected however that some licenses will include rules that have not been
encountered before, which will therefore need to be created (and described)
by the data cataloguer at the time of describing the license (using the simple
mechanisms for editing provided by Wordpress).
3. The dataset is not yet associated with a license (red arrows): This is the
most complex case, as the policy for inclusion in the Datahub will in this
case need to be negotiated directly with the data owner. An interesting
aspect of this however is that this negotiation can be done on the basis
of existing licenses in the datahub, and reuse existing rules (permissions,
prohibitions and duties) as a basis for creating a new policy speci c to the
negotiated Dataset, therefore increasing reuse and reducing the complexity
of the negotiation step. It will also facilitate the review of the established
license by the Legal Department of The Open University in the process of
approving it.</p>
        <p>
          It is important to notice here that, beyond the licenses and policies directly
attached to the Dataset by/with the data owner, there are other aspects of the
datasets that might require the creation of speci c policies. These include:
{ When the data have human or animal sujects included: An Ethics Committee
will have to review not only the license, but also the data itself, to check that
the data is collected, processed and re-distributed in a way that is compatible
with the ethical standards in place.
{ When the data includes personal data: The Data Protection O cer of the
organisation will also have to review the license and dataset. In this case,
they will not only be in charge of approving the inclusion of the dataset with
the established license, but also to support the Data Governance Board in
creating a privacy policy (in accordance with the Data Protection Act [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ])
for the dataset. This privacy policy will be represented in the same way as
the license and other policies associated with each dataset.
        </p>
        <p>While this process appears very complex (and it is), it also demonstrates the
value of the structured representation with clear semantics (based on ODRL)
and of the availability of a simple tool to create, reuse and manipulate these
structured representations. Indeed, here, it is expected that the set of licenses
and taxonomies will grow progressivly as datasets are being included, creating
a knowledge base of policies and rules becoming more and more useful in the
description of new licenses, and therefore supporting scaling the data catalogue
to thousands of datasets by reducing the e ort required to review their policies.
3.3</p>
      </sec>
      <sec id="sec-6-2">
        <title>Propagating Dataset Characteristics Accross Data ows</title>
        <p>The previous sections discuss the way to manage policies and rules that apply to
the distribution and use of data as static objects. However, as described before,
a key role of the MK Datahub is also to provide the basic services required to
process such data, including integrating di erent sources with each other and
analysing their content to create new information from several, raw datasets.
Every processing step, manipulation and combination of data naturally has an
e ect on the policy rules in place: Combining datasets might require to
combine rules in a non-trivial way; Anonymising datasets might reduce some of the
permissions and prohibitions associated with the privacy policy and data
protection; Modifying the data might nullify some of the permissions and contradict
some of the prohibitions. Having a structured representation of the policy rules
that apply to a dataset (in ODRL as described above) naturally helps with this
issue. However, it also requires a semantic representation of the data ows, i.e. a
representation of what has happened, fundamentally, to data in such a data ow,
in order to understand how it might a ect the permissions, prohibitions and
duties that are attached to it.</p>
        <p>
          There are several approaches to represent work ows on data. One of the
most related to our issue here is Prov-O [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the provenance ontology, which
includes the representation of processes that might a ect data beyond their original
sources. However, while the process applied might, in itself, a ect and/or
propagate the policy rules attached to the dataset being processed, these processes
only represent the mechanisms that implement a functional relationship between
the dataset(s) in input and the dataset in output. While processes might be
organised and structured according to a variety of dimensions, it is really these
relationships that dictate the way the policy rules are a ected and propagated
from one (raw) dataset to another (processed) one. A meaningful representation
of these relationships and of their taxonomic organisation is therefore what is
needed to establish policy rule propagation mechanisms.
        </p>
        <p>
          For this reason, and other similar use cases, we created the Datanode
ontology [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]9, which can be described, in a nutshell, as an ontology of relationships
between data artefacts. In this ontology, Datanode is a class of objects which
is meant as an abstract class for di erent types and forms of data artefacts
(datasets, databases, streams, feeds, etc). The rest of the ontology consists of the
formal de nition of a taxonomy of relationships between datanodes (as
properties), including various aspects such as the description of datanodes (metadata),
datanodes being derivations from others (e.g. from mining), datanodes
overlaping in terms of their interpretations, capabilities, etc.
        </p>
        <p>Through the Datanode Ontology, it is possible to represent, in a very simple
and basic form, data ows that focus on the fundamental relationships that exist
between the origin datasets, intermediary ones and the nal result. It is also
possible to express rules for propagating or a ecting policies and policy rules in
these data ows. For example, it is possible to express that a datanode that is
an anonymised version of another one does not need to include in the attached
policy the duty rules that are related to data protection. It can also express
that duties such as attribution should be propagated to any dataset that are
derived from any that included this policy rule originally, but not necessaraly
through other relations. A prohibition to modify the dataset means that relations
such as being a copy, a version, a description or overlaping with the considered
one should not a ect the policy, but any relation encapsulating a modi cation
(derivation) would nullify any permission rule attached to the same policy. In</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>9 http://www.enridaga.net/datanode/0.3/docs/</title>
      <p>other words, relying on a semantic, structured representation of both the policies
attached to datasets and of the data ows in which they might be included allows
us to reason upon the way policy rules propagate wihtin these data ows.
4</p>
      <sec id="sec-7-1">
        <title>Related Work</title>
        <p>
          MK:Smart is certainly neither the rst, nor the last smart city to be developed
in Europe and the world. Many of these initiatives have very similar focuses,
including the collection of large amounts of data from sources including sensors,
local governements and companies. While the notion of a smart city goes beyond
technology and include aspects of policy and governance which are certainly
connected to the work presented here (see e.g. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]), to the best of our knowledge,
very little attention has been given so far to structuted approaches for the
management of the very diverse licenses and policies that can be attached to datasets
in such enviroments. Dublin City is an interesting example, bene ting from the
work of IBM to develop the necessary technical infrastructures and tools (see
e.g. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]), and from a clear ambition from the local authorities to release data
(see Dublinked10), especially as open data. Some of these technical
architectures do actually mention aspects strongly related to the ones we focus on here,
such as provenance (to track the sources of data, and the processes associated
with them), privacy and general governance policies (see for example [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]).
However, little details are actually available about the way in which these aspects
are handled, and whether they bene t, like in our approach, from a structured,
ontology-based representation of data policies and of their components. The
result is that, while Dublin might be one of the most \data-centric" of smart city
iniatitives, licenses and restrictions are still represented in the Dublinked portal,
as in most other data catalogs out there, as basic, textual metadata elds.
        </p>
        <p>
          While the issues described above, related to the management of highly diverse
data sources in smart cities from the point of view of licenses and policies have
surprisingly received little attention, the idea of employing a structured, machine
readable representation of these policies is not new. While we use ODRL here,
other rights representation languages exist, including for example CC-Rel (the
Creative Commons Rights Expression Language11), focusing on open licenses.
It is actually the case that such policies could be represented using more or
less any kind of rule language, including for example AIR [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which includes
the representation of policies as a core use case. ODRL has the advantage to
be dedicated to the explicit representation of digital rights, which are the most
related to the variety of data licenses of interest here.
        </p>
        <p>
          Naturally, there has been many other uses of ODRL, as an ontology to
represent rights and usage policies, in di erent contexts [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], for the enforcement
of access rights [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] or to check the compatibilities of the licenses associated with
web data [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
10 http://www.dublinked.com/
11 https://wiki.creativecommons.org/CC_REL
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>Conclusion</title>
        <p>In this paper, we presented the approach taken by the MK:Smart project to deal
with the diversity of the thousands of datasets to be included in the common
data infrastructures of a smart city (the MK Datahub). We show in particular
how relying on an ontology-based, semantic and structured representation of the
description of each dataset, including the associated policies and the permission,
prohibition and duty rules attached to them can help in managing and curating
these diverse datasets in a formal and robust way, while reducing the e ort
required in keeping track of all the di erent situations.</p>
        <p>While what we are proposing here is still preliminary, and will have to prove
its value as the project evolves to include more and more datasets, it is engrained
in the broader, timely challenges faced by many similar initiatives: Dealing with
more and more data, from more and more sources, and with the di culty of
facing increasingly diverse situations regarding regulation, rights, terms and policies
in data curation. In other terms, from the point of view of innovation, while this
growth in the accessibility of data creates great new opportunities, it is also
creating increasing complexity in the necessary process to curate these data.
While currently these issues are too often overlooked, the area of smart cities
will have to address them, making the need for structured, semantic
representations more prominent not only for the data themselves, but also for all other
aspects associated with them.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Keith</given-names>
            <surname>Alexander</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          .
          <article-title>Describing linked datasets-on the design and usage of void, thevocabulary of interlinked datasets</article-title>
          .
          <source>In In Linked Data on the Web Workshop (LDOW 09)</source>
          ,
          <source>in conjunction with 18th International World Wide Web Conference (WWW 09. Citeseer</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Tuba Bak c , Esteve Almirall, and
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Wareham</surname>
          </string-name>
          .
          <article-title>A smart city initiative: the case of barcelona</article-title>
          .
          <source>Journal of the Knowledge Economy</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <volume>135</volume>
          {
          <fpage>148</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Lorrie</given-names>
            <surname>Cranor</surname>
          </string-name>
          .
          <article-title>Web privacy with P3P. "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc."</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Enrico</given-names>
            <surname>Daga</surname>
          </string-name>
          , Mathieu d'Aquin,
          <string-name>
            <given-names>Aldo</given-names>
            <surname>Gangemi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Enrico</given-names>
            <surname>Motta</surname>
          </string-name>
          .
          <article-title>From datasets to datanodes: An ontology pattern for networks of data artifacts. Semantic Web Journal (submitted), http: // semantic-web-journal. net/ content/ datasets-datanodes-ontology-pattern-networks-data-</article-title>
          <string-name>
            <surname>artifacts</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Garc</surname>
          </string-name>
          <string-name>
            <surname>a</surname>
          </string-name>
          , Rosa Gil, Isabel Gallego, and
          <string-name>
            <given-names>Jaime</given-names>
            <surname>Delgado</surname>
          </string-name>
          .
          <article-title>Formalising odrl semantics using web ontologies</article-title>
          .
          <source>In Proc. 2nd Intl. ODRL Workshop</source>
          , pages
          <volume>1</volume>
          {
          <fpage>10</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Susanne</given-names>
            <surname>Guth</surname>
          </string-name>
          , Gustaf Neumann, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Strembeck</surname>
          </string-name>
          .
          <article-title>Experiences with the enforcement of access rights extracted from odrl-based digital contracts</article-title>
          .
          <source>In Proceedings of the 3rd ACM workshop on Digital rights management</source>
          , pages
          <volume>90</volume>
          {
          <fpage>102</fpage>
          . ACM,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Renato</given-names>
            <surname>Iannella</surname>
          </string-name>
          .
          <article-title>Open digital rights language (odrl) version 1.1</article-title>
          . W3c Note,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Ankesh</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          , Jie Bao, Lalana Kagal, Ian Jacobi,
          <string-name>
            <given-names>Li</given-names>
            <surname>Ding</surname>
          </string-name>
          , and
          <string-name>
            <given-names>James</given-names>
            <surname>Hendler</surname>
          </string-name>
          .
          <article-title>Analyzing the air language: a semantic web (production) rule language</article-title>
          .
          <source>In Web Reasoning and Rule Systems</source>
          , pages
          <fpage>58</fpage>
          {
          <fpage>72</fpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Lebo</surname>
          </string-name>
          , Satya Sahoo,
          <string-name>
            <surname>Deborah</surname>
            <given-names>McGuinness</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Khalid</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          , James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes,
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Zednik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Prov-o: The prov ontology</article-title>
          .
          <source>W3C Recommendation, 30th April</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Freddy</surname>
            <given-names>Lecue</given-names>
          </string-name>
          , Spyros Kotoulas, and
          <article-title>Pol Mac Aonghusa. Capturing the pulse of cities: A robust stream data reasoning approach</article-title>
          . Position paper,
          <source>IBM Research</source>
          , Smarter Cities Technology Centre, Dublin, Ireland,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vanessa</surname>
            <given-names>Lopez</given-names>
          </string-name>
          , Spyros Kotoulas, Marco Luca Sbodio, Martin Stephenson, Raymond Lloyd,
          <article-title>Aris Gkoulalas-Divanis, and Pol Mac Aonghusa</article-title>
          .
          <article-title>Queriocity: Accessing the information of a city</article-title>
          .
          <source>In Workshops at the Twenty-Sixth AAAI Conference on Arti cial Intelligence</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <article-title>Taewoo Nam and Theresa A Pardo. Conceptualizing smart city with dimensions of technology, people, and institutions</article-title>
          .
          <source>In Proceedings of the 12th Annual International Digital Government Research Conference: Digital Government Innovation in Challenging Times</source>
          , pages
          <volume>282</volume>
          {
          <fpage>291</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>British</given-names>
            <surname>Parliament</surname>
          </string-name>
          .
          <source>Data protection act of</source>
          <year>1998</year>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Carlos Serra~o, Miguel
          <string-name>
            <surname>Dias</surname>
            , and
            <given-names>Jaime</given-names>
          </string-name>
          <string-name>
            <surname>Delgado</surname>
          </string-name>
          .
          <article-title>Using odrl to express rights for different content usage scenarios</article-title>
          .
          <source>In ODRL2005{2nd International ODRL Workshop</source>
          , pages
          <fpage>7</fpage>
          <issue>{8</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Simone</surname>
          </string-name>
          Tallevi-Diotallevi, Spyros Kotoulas, Luca Foschini, Freddy Lecue, and Antonio Corradi.
          <article-title>Real-time urban monitoring in dublin using semantic and stream technologies</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2013</year>
          , pages
          <fpage>178</fpage>
          {
          <fpage>194</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fabien</given-names>
            <surname>Gandon</surname>
          </string-name>
          .
          <article-title>Licenses compatibility and composition in the web of data</article-title>
          .
          <source>In Proceedings of the Consuming Linked Data workshop, COLD</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Stuart</surname>
            <given-names>Weibel</given-names>
          </string-name>
          , John Kunze, Carl Lagoze, and
          <string-name>
            <given-names>Misha</given-names>
            <surname>Wolf</surname>
          </string-name>
          .
          <article-title>Dublin core metadata for resource discovery</article-title>
          .
          <source>Internet Engineering Task Force RFC</source>
          ,
          <volume>2413</volume>
          (
          <issue>222</issue>
          ):
          <fpage>132</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>