<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Liing Data Portals to the Web of Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Or: Linked Open Data (Portals) - Now for real!</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sebastian Neumaier, J u ̈rgen Umbrich, and Axel Polleres Vienna University of Economics and Business</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data portals are central hubs for freely available (governmental) datasets. ese portals use dierent soware frameworks to publish their data, and the metadata descriptions of these datasets come in dierent schemas accordingly to the framework. e present work aims at re-exposing and connecting the metadata descriptions of currently 854k datasets on 261 data portals to the Web of Linked Data by mapping and publishing their homogenized metadata in standard vocabularies such as DCAT and Schema.org. Additionally, we publish existing quality information about the datasets and further enrich their descriptions by automatically generated metadata for CSV resources. In order to make all this information traceable and trustworthy, we annotate the generated data using the W3C's provenance vocabulary. e dataset descriptions are harvested weekly and we oer access to the archived data by providing APIs compliant to the Memento framework. All this data - a total of about 120 million triples per weekly snapshot - is queryable at the SPARQL endpoint at data.wu.ac.at/portalwatch/sparql.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Open data portals, such as Austria’s data.gv.at or UK’s data.gov.uk,
are central points of access for freely available (governmental)
datasets. Government departments publish all kind of data on these
portals, for instance, about economy, demography, public
spendings, etc., which improves government transparency, accountability
and public participation. ese data catalogs mostly use soware
frameworks, such as CKAN1 or Socrata2 to publish their data.
However, the metadata descriptions for the published data on these
portals is only partially available as Linked Data. is work aims
at exposing and connecting the datasets’ metadata of various
portals to the Web of Linked Data by mapping and publishing their
metadata in standard vocabularies such as DCAT and Schema.org.
Providing the datasets in such a way enables “to link this structured
metadata with information describing locations, scientic
publications, or even [knowledge graphs], facilitating data discovery for
others.”3</p>
      <p>CKAN and Socrata use their own metadata schemas for
describing the published datasets. Further, these data portal frameworks
allow the portal providers to extend their schemas with additional
metadata keys. is potentially leads to diverse and heterogeneous
metadata across dierent data catalogs. Also, CKAN’s and Socrata’s
metadata schemas do not contain links to any external knowledge
and existing ontologies, and no links to other datasets and other
data catalogs. is lack of external references implies the risk of
rather having data silos instead of connected and interlinked data
portals.</p>
      <p>
        e W3C identied the issue of heterogeneous metadata schemas
across the data portals, and proposed an RDF vocabulary to solve
this issue: e metadata standard DCAT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (Data Catalog
Vocabulary) describes data catalogs and corresponding datasets. It models
the datasets and their distributions (published data in dierent
formats) and re-uses various existing vocabularies such as Dublin Core
terms, and the SKOS vocabulary.
      </p>
      <p>However, currently only a limited number of (governmental)
open data portals use the DCAT standard for describing their
metadata. Further, DCAT is not always directly applicable to the specic
schema extensions and dataset publishing practices deployed by
particular data portals. For instance, while DCAT describes the
distributions of datasets as dierent downloadable representations
of the same content (i.e., the dataset in dierent le formats), we
observe in CKAN data portals various dierent aggregations of
datasets: a dataset might be grouped by dimensions such as spatial
divisions (e.g., data for districts of cities) or grouped by
temporal aspects (e.g., same content for dierent years). is means
that for certain datasets a mapping to the DCAT vocabulary is not
straightforwardly possible and some extensions might be needed
to accommodate for such common practices.</p>
      <p>While mapping data portals’ metadata to DCAT is certainly a
step towards a beer interlinking of open datasets, the major search
engines do not integrate this information for enriching their search
results. To enable an integration of data catalogs into the knowledge
graphs of search engines (such as Google’s Knowledge Vault4) we
further map and publish the DCAT metadata descriptions using
Schema.org’s Dataset vocabulary.5</p>
      <p>
        In order to tackle the aim of beer integration of datasets on
several fronts, we do not only want to expose the metadata
descriptions as Linked Data in an homogenized representation, but
also improve the descriptions of the actual data. Since CSV is the
predominant format on open data portals [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we make use of the
W3C’s metadata vocabulary for describing CSVs on the Web.6 is
recommendation gives the data publisher a standardized way of
describing the dialect, columns, datatypes, etc. of tabular data.
      </p>
      <p>Figure 1 displays the dierent components and the integration
of the overall system. Initially, our framework collects the dataset
descriptions (as JSON documents) and maps it to a common
representation – the DCAT vocabulary. en it computes for each dataset
a quality assessment, and (in case of CSV resources) additional CSV
metadata. We model this additional information using W3C
vocabularies (DQV and CSVW) and connect the data to the DCAT
4hps://www.google.com/intl/bn/insidesearch/features/search/knowledge.html
5hps://schema.org/Dataset
6hps://www.w3.org/2013/csvw/
representations. Since all this data is automatically generated by
our framework, we also add provenance information to the dataset
description, the quality measurements and the CSV metadata.</p>
      <p>Overall, in this paper we make the following concrete
contributions:</p>
      <p>A system that re-exposes data extracted from open data
portal APIs such as CKAN. e output formats include a
subset of W3C’s DCAT with extensions and Schema.org’s
Dataset-oriented vocabulary (Section 3).</p>
      <p>An analysis of the exposed metadata reporting which
metadata descriptions can be easily/straightforwardly mapped
(and which not, respectively). We illustrate issues and
challenges when mapping CKAN metadata to DCAT, and
highlight potential design issues/improvements in the target
vocabularies (i.e., W3C’s DCAT and Schema.org’s Dataset
vocabulary), see Section 3.1.</p>
      <p>We enrich the integrated metadata by the quality
measurements of the Portal Watch framework available as RDF
data using the Data ality Vocabulary (Section 4).</p>
      <p>We present heuristics to further enrich descriptions of
tabular data semi-automatically by auto-generating additional
metadata using the vocabulary dened by the W3C CSV
on the Web working group, which we likewise add to our
enriched metadata. Additionally, as a by-product, a user
interface to generate and rene such CSV metadata is
available (Section 5).</p>
      <p>We use the PROV ontology to record and annotate the
provenance of our generated/published data (which is
partially generated by using heuristics). e details are
described in Section 6.</p>
      <p>All the integrated, enriched and versioned metadata is
publicly available as Linked Data at hp://data.wu.ac.at/
portalwatch/. Additionally, a SPARQL endpoint to query
the generated RDF, described in Section 7.1.</p>
      <p>
        Finally, we enable historic access to the original and mapped
dataset descriptions using the Memento framework [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
cf. Section 7.2.
      </p>
      <p>We conclude and give an outlook on future work in Section 8.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND &amp; RELATED WORK</title>
      <p>e recent DCAT application prole for data portals in Europe
(DCAT-AP)7 extends the DCAT core vocabulary and aims towards
the integration of datasets from dierent European data portals.
In its current version (v1.1) it extends the existing DCAT schema
by a set of additional properties. DCAT-AP allows to specify the
version and the period of time of a dataset. Further, it classies
certain predicates as “optional”, “recommended” or “mandatory”. For
instance, in DCAT-AP it is mandatory for a dcat:Distribution
to hold a dcat:accessURL.</p>
      <p>
        An earlier approach, in 2011, is the the VoID vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
published by W3C as an Interest Group Note. VoID – the Vocabulary
for Interlinked Datasets – is an RDF schema for describing metadata
about linked datasets: it has been developed specically for data in
RDF representation and is therefore complementary to the DCAT
model and not fully suitable to model metadata on Open Data
portals (which usually host resources in various formats) in general.
      </p>
      <p>
        In 2011 Fu¨rber and Hepp [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] propose an ontology for data quality
management that allows the formulation of data quality, cleansing
rules, a classication of data quality problems and the computation
of data quality scores. e classes and properties of this ontology
include concrete data quality dimensions (e.g., completeness and
accuracy) and concrete data cleansing rules (such as whitespace
removal) and provides a total of about 50 classes and 50 properties.
e ontology allows a detailed modelling of data quality
management systems, and might be partially applicable and useful in our
system and to our data. However, we decided to follow the W3C
Data on the Web Best Practices and use the more lightweight Data
ality Vocabulary for describing the quality assessment
dimensions and steps.
      </p>
      <p>
        More recently, in 2015 Assaf et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] propose HDL, an
harmonized dataset model. HDL is mainly based on a set of frequent
CKAN keys. On this basis, the authors dene mappings from other
metadata schemas, including Socrata, DCAT and Schema.org.
      </p>
      <p>
        Portal Watch. e contributions of this paper are based on and
build upon the Open Data Portal Watch project [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Portal Watch
is a framework for monitoring and quality assessment of
(governmental) Open Data portals, see hp://data.wu.ac.at/portalwatch. It
monitors data from portals using the CKAN, Socrata, and
OpenDataSo soware, as well as portals providing their metadata in
the DCAT RDF vocabulary.
      </p>
      <p>7hps://joinup.ec.europa.eu/asset/dcat application prole/description</p>
      <p>
        Currently, as of the second week of 2017, the framework
monitors 261 portals, which describe in total about 854k datasets with
more than 2 million distributions, i.e., URLs (cf. Table 1). As we
monitor and crawl the metadata of these portals in a weekly
fashion, we can use the gathered insights in two ways to enrich the
integrated metadata of these portals: namely, (i) we publish and
serve the integrated metadata descriptions in a weekly, versioned
manner, (ii) we enrich these matadata desriptions by the quality
assessment metrics dened in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>GENERATING DCAT AND SCHEMA.ORG</title>
      <p>e DCAT model suggests three main classes: dcat:Catalog, dcat:
Dataset and dcat:Distribution. e denition of a dcat:Catalog
corresponds to the concept of data portals, i.e., it describes a
webbased data catalog and holds a collection of datasets (using the
dcat:dataset property). An instance of the dcat:Dataset class
describes a metadata instance which can hold one or more
distributions, a publisher, and a set of properties describing the dataset. A
dcat:Distribution instance provides the actual references to the
resource (using dcat:accessURL or dcat:downloadURL). Further,
it contains properties to describe license information (dct:license),
format (dct:format) and media-type (dct:mediaType)
descriptions and general descriptive information (e.g, dct:title and
dcat:byteSize).</p>
      <p>
        e Portal Watch framework maps the harvested metadata
descriptions from CKAN, Socrata and OpenDataSo portals to the
DCAT vocabulary (as dened and described in detail in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). For
instance, the values of the CKAN metadata elds title, notes, or tags get
mapped to DCAT using the properties dct:title, dct:description,
and dcat:keyword which are associated to a certain dataset
instance.8 For CKAN metadata, this mapping is mainly based on an
existing CKAN extension to export RDF serializations of CKAN
datasets based on DCAT.9 is extension is maintained by the Open
Knowledge Foundation and the source code is also in use in the
Portal Watch framework (in a slightly adapted form).
      </p>
      <p>As a next step, we generate Schema.org compliant dataset
descriptions by mapping the DCAT descriptions to Schema.org’s
dataset vocabulary. is mapping is implemented based on a W3C
working dra.10 e three main DCAT classes Catalog, Dataset,
and Distribution are mapped to the Schema.org classes DataCatalog,
Dataset, and DataDownload (cf. Table 2). e mapping dened in
the W3C dra covers all core properties specied in the DCAT
recommendation except for dcat:identifier and dcat:frequency.
8dct: is the Dublin Core Terms namespace
9hps://github.com/ckan/ckanext-dcat
10hps://www.w3.org/2015/spatial/wiki/ISO 19115 - DCAT - Schema.
org mapping, last accessed 2016-12-26</p>
      <sec id="sec-3-1">
        <title>DCAT</title>
        <p>dcat:Catalog
dcat:Dataset
dcat:Distribution
dcat:identifier
dcat:frequency</p>
      </sec>
      <sec id="sec-3-2">
        <title>Schema.org</title>
        <p>schema:DataCatalog
schema:Dataset
schema:DataDownload
?
?</p>
        <p>All mapped dataset descriptions for all weekly harvested versions
(hereaer referred to as snapshots) are accessible via the Portal
Watch API (hp://data.wu.ac.at/portalwatch/api/v1):
/portal/fportalidg/fsnapshotg/dataset/fdatasetidg/dcat
is interface returns the DCAT description in JSON-LD
for a specic dataset. e parameter portalid species
the data portal and datasetid the dataset. e parameter
snapshot allows to select archived versions of the dataset:
the parameter has to be provided as yyww specifying the
year and week of the dataset, e.g., 1650 for week 50 in 2016.
/portal/fportalidg/fsnapshotg/dataset/fdatasetidg/schemadotorg
Analogous to the above API call, this interface returns the
Schema.org mapping for a dataset as JSON-LD, using the
same parameters.</p>
        <p>Additionally, we publish the Schema.org dataset descriptions as
single, crawl-able, web pages, listed at hp://data.wu.ac.at/odso
(and hp://data.wu.ac.at/odso/sitemap.xml as access point for search
engines, respectively). ese Schema.org metadata descriptions are
embedded within the HTML pages, following the W3C JSON-LD
recommendation.11
3.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Challenges and mapping issues</title>
      <p>3.1.1 DCAT mapping available on CKAN portals. e
aforementioned CKAN-to-DCAT extension denes mappings for CKAN
datasets and their resources to the corresponding DCAT classes
dcat:Dataset and dcat:Distribution and oers it via the CKAN API.
However, in general it cannot be assumed that this extension is
deployed for all CKAN portals: we were able to retrieve the DCAT
descriptions of datasets for 93 of the 149 active CKAN portals
monitored by Portal Watch.</p>
      <p>e CKAN soware allows portal providers to include additional
metadata elds in the metadata schema. When retrieving the
metadata description for a dataset via the CKAN API, these keys are
included in the resulting JSON under the metadata elds “extras”.
However, it is not guaranteed that the DCAT conversion of the
CKAN metadata contains these extra elds. Depending on the
version and the conguration of the export-extension there are three
dierent cases:
Predened mapping: In recent versions of the extension the portal
provider can dene a mapping for certain CKAN elds
to a specic RDF property. For instance, a CKAN extra
eld contact-email (which is not by default part of the
11hps://www.w3.org/TR/json-ld-syntax/#embedding-json-ld-in-html-documents
CKAN schema and is not dened in the extension’s
mapping) could be mapped to an RDF graph using the property
vcard:hasEmail from the vCard ontology, e.g.:
&lt; http :// e x a m p l e . com / example - dataset &gt;
dcat : c o n t a c t P o i n t [</p>
      <p>vc ard : h a s E m a i l " e x a m p l e @ e m a i l . com "
] ;
Default mapping: A paern for exporting all available extra
metadata keys which can be observed in several data portals
is the use of the dct:relation (Dublin Core vocabulary)
property to describe just the label and value of the extra
keys, e.g.:
&lt; http :// e x a m p l e . com / example - dataset &gt;
dct : r e l a t i o n [
rdfs : l abe l " contact - e mai l " ;
rdf : v alu e " e x a m p l e @ e m a i l . com "
] ;
No mapping: e retrieved DCAT description returns no mapping
of these keys and the information is therefore not available.
In order to avoid these dierent representations (and potentially
missing information) of extra metadata elds, we do not harvest
the DCAT mappings of the CKAN portals but rather the original,
complete, JSON metadata description available via the CKAN API
and apply a (rened) mapping to DCAT at our framework.</p>
      <p>3.1.2 Use of CKAN extra metadata fields. We analysed the
metadata of 749k datasets over all 149 CKAN portals and extracted a
total of 3746 distinct extra metadata elds. Table 3 lists the most
frequently used elds sorted by the number of portals they appear
in; most frequent spatial in 29 portals. Most of these cross-portal
extra keys are generated by widely used CKAN extensions. e
keys in Table 3 are all generated by the harvesting12 and spatial
extension.13 We manually selected mappings for the most frequent
extra keys if they are not already included in the mapping; the
selected properties are listed in the “DCAT key” column in Table 3.
In case of an ? cell, we were not able to choose an appropriate
DCAT core property.</p>
      <p>Looking into more detail of these 3746 extra keys, we
discovered that 1553 unique keys are of the form links:fdataset-idg,
e.g., links:air-temperature or links:air-pressure. All these
links:-keys originate from the datahub.io portal, which provides
references to Linked Data as CKAN datasets. e portal uses these
keys to encode links between two datasets within the portal. While
this information is certainly benecial, the encoding of links
between datasets (i.e., using the metadata key as link) shows the need
for expressing relations between datasets in existing data portals.</p>
      <p>3.1.3 Modelling CKAN datasets. e CKAN soware allows
data providers to add multiple resources to a dataset description.
ese resources are basically links to the actual data and some
additional corresponding metadata (e.g., format, title, mime-type).</p>
      <p>is concept of resources relates to distributions in DCAT. A
DCAT distribution is dened the following way: “Represents a
specic available form of a dataset. Each dataset might be available
in dierent forms, these forms might represent dierent formats of the
dataset or dierent endpoints. […]”15 is means that distributions of
a dataset should consist of the same data in dierent representations.
We applied the following two heuristics in order to nd out if CKAN
resources are used as distributions, i.e., if CKAN resources represent
the same content in dierent formats:</p>
      <p>Title similarity: We compared the titles of resources of
a dataset using Ratcli-Obershelp string similarity used
in the Python diib library. In case any two
resourcetitles have a measure of higher than 0.8 (with a maximum
similarity of 1) we consider the resources as “distributions”.
For instance, two resources with titles “air-temperature.csv”
and “air-temperature.json” most likely contain the same
data in CSV and JSON format.</p>
      <p>Formats: We looked into the le formats of the resources
and report the number of datasets where all formats
differ or some formats appear multiple times (e.g., a dataset
consisting of two CSVs which indicates dierent content
in these les).</p>
      <p>Out of the 767k CKAN datasets about half of them hold more
than one resource (cf. Table 5). Out of the 401k multi-resource
datasets, for 140k datasets all corresponding le formats are
dierent, indicating that these are possibly distributions of the dataset.</p>
      <p>14e high number of keys occurring in two portals is potentially due to the fact
that many portals harvest datasets, i.e. the metadata descriptions, of other portals (see
the number of portals using the harvesting extension in Table 3).</p>
      <p>15hps://www.w3.org/TR/vocab-dcat/#class-distribution
Using string similarity we encountered similar titles for at least two
resources in 261k out of the 401k multi-resource datasets.
ese numbers indicate that there is no common agreement on
how to use resources in CKAN. On the one hand there is a high
number of datasets where resources are published as “distributions”
(see all di. le formats and similar titles in Table 5) while on
the other hand the remaining datasets group resources by other
aspects (see multi-appearance); e.g., a dataset consisting of the
resources “air-temperature-2013.csv”, “air-temperature-2014.csv”,
“air-temperature-2015.csv”.
4</p>
    </sec>
    <sec id="sec-5">
      <title>USING THE DATA QUALITY VOCABULARY</title>
      <p>In this section we summarize the quality assessment performed by
the Portal Watch framework and we detail how this measurements
are published and connected to the corresponding datasets.</p>
      <p>Beside the regular crawling and monitoring of data portals, the
Portal Watch framework performs a quality assessment along
several quality dimensions and metrics. ese dimensions and metrics
are dened on top of the DCAT vocabulary, which allows us to
treat and assess the content independent of the portal’s soware
and metadata schema.</p>
      <p>
        is quality assessment is performed along several dimensions:
(i) e existence dimension consists of metrics checking for
important information, e.g., if there is contact information in the metadata.
(ii) e metrics of the conformance dimension check if the available
information adheres to a certain format, e.g., if the contact
information is a valid email address. (iii) e open data dimension’s metrics
test if the specied format and license information is suitable to
classify a dataset as open. e formalization and implementation
details can be found in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>e W3C’s Data ality Vocabulary16 (DQV) is intended to
be an extension to the DCAT vocabulary. It provides classes to
describe quality dimensions, metrics and measurements, and
corresponding properties. We use DQV to make the quality measures
of the Portal Watch framework as RDF available and to link the
assessment to the dataset descriptions. Figure 2 displays an
example quality assessment modelled in the DQV. e italic descriptions
(e.g., dqv:alityMeasurement and dqv:Metric) denote the classes
of the entities, i.e., the a-relations. e measurements of a dataset
are described by using a blank node (cf. :bn) and the dqv:value
property to assign quality measures to the datasets.</p>
      <p>API access to the measurements. e DQV results can be retrieved
by using the following API or by querying the SPARQL endpoint
(see Section 7.1):
/portal/fportalidg/fsnapshotg/dataset/fdatasetidg/dqv
Analogous to the previous APIs (see Section 3), this
interface returns the DQV results in JSON-LD for a specic
16hps://www.w3.org/TR/vocab-dqv/
dataset, requiring the parameters portalid, datasetid
and snapshot (specifying the year and week of the dataset).
5</p>
      <p>
        USING W3C’S CSVW METADATA
e W3C’s CSV on the Web Working Group17 (CSVW) proposed a
metadata vocabulary that describes how CSV data
(comma-separatedvalue les or similar tabular data sources) should be interpreted
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. e vocabulary includes properties such as primary and
foreign keys, datatypes, column labels, and CSV dialect descriptions
(e.g., delimiter, quotation character and encoding).
      </p>
      <p>We use this W3C vocabulary to further describe CSV resources
in our corpus of data portals. erefore, we lter out all resource
URLs which use CSV as their le format in the dataset description.
We try to retrieve the rst 100 lines of each of these CSVs and apply
the following methods and heuristics to determine the dialect and
properties of the CSVs:</p>
      <p>We use the “Content-Type” and “Content-Length” HTTP
response header elds to get the media type and le size
of the resource. Note, that both of these elds might
contain not accurate information in some cases, since
some servers send the content length of the compressed
resource and also use the compression’s media type (e.g.,
application/gzip).</p>
      <p>We use the Python magic package18 to detect the le
encoding of the retrieved resource.</p>
      <p>We slightly modied the default Python CSV parser by
including the encoding detection and rening the
delimiter detection (by increasing the number snied lines and
modifying the preferred delimiters); the Python module is
online available.19
We heuristically determine the number of header lines in
a CSV le by considering changes to datatypes within the
rst rows. For instance, if we observe columns where all
entries are numerical values and follow the same paern –
including the rst row – we do not consider the rst row
as a leading header row.20
We perform a simple datatype detection on the columns of
the CSVs: we distinguish between columns which contain
numerical, binary, datetime or any other “string” values,
and use the respective XSD datatypes21 number, binary,
datetime and anyAtomicType.</p>
      <p>
        is acquired information is then used to generate RDF which
is compliant to the CSVW metadata vocabulary [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Figure 3
displays an example graph for a CSV resource. e blank node :csv
represents the CSV resource which can be downloaded at the URL
at property csvw:url. e values of the properties dcat:byteSize
and dcat:mediaType are values of the corresponding HTTP header
elds. e dialect description of the CSV can be found via the blank
node :dialect at property csvw:dialect and the columns of
the CSV are connected to the :schema blank node (describing the
csvw:tableSchema of the CSV).
was not able to detect/guess the delimiter of the CSV table. e
remaining download URLs are either malformed URLs, or resulted
in connection timeout and server errors.
      </p>
      <p>In future work we want to increase this number of analysed CSVs.
ere are CSV les with missing and wrong format descriptions,
which could be detected by using the le extensions and the HTTP
media type of the resources.
5.1</p>
      <p>Assisted generation of CSVW metadata
e Portal Watch includes for each portals’ CSV les a link to an
pre-lled UI form. is form allows to further describe the name,
datatype and properties of each detected column, and is pre-ll with
the following detected dialect description elds: commentPrex (is
there a prex for leading comment lines), doubleote (use of ”” as
escape character for quotation), delimiter, encoding, header,
headerRowCount (number of header rows), lineTerminators, quoteChar
(the character used for quotation).</p>
      <p>e generated CSVW metadata can be downloaded as JSON-LD
in order to publish it along with the corresponding CSV le. Figure 4
displays the form: “table description” provides general descriptions
such as language, title, and publisher; “column description” provides
properties for each column separately; “dialect description” allows
the description and modication of the detected CSV dialect.
For our latest snapshot (third week of 2017) there were a total
of 222k URLs with “CSV” in the dataset’s metadata description.
Out of these, we successfully parsed and generated the CSVW
metadata for 153k les. For 44k les we were not able to parse
the le and read the rst lines. Possible reasons are that the les
are not in the described format (e.g., compressed) or our parser
19hps://github.com/sebneu/anycsv
20Obviously, there are cases where this heuristic may fail. Our intention here is
that this “guessed” information already might be of value for an user.</p>
      <p>21hp://www.w3.org/2001/XMLSchema</p>
    </sec>
    <sec id="sec-6">
      <title>ADDING PROVENANCE ANNOTATIONS</title>
      <p>Apart from generating mappings, quality measurements and
enrichments of the metadata alone, in order to make data traceable
and allow users to judge the trustworthiness of data, it is import
to record the provenance of our generated/published data. ere
are several approaches to address this issue for RDF. A lightweight
approach could use dierent Dublin Core properties to refer from a
dataset to entities/agents (i.e., our system) which published the
resources, e.g., by using properties such as dc:publisher. However,
the DCAT metdata descriptions already use these Dublin Core
properties and therefore such additional annotations would interfere
with the existing dataset descriptions.</p>
      <p>
        e PROV ontology [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a more exible approach which
provides an ontology to annotate all kinds of resources with provenance
information and allows tracking of provenance of resource
representations. On a high level PROV distinguishes between entities,
agents, and activities. Entities can be all kinds of things, digital or
not, which are created or modied. Activities are the processes
which create or modify entities. An agent is something or someone
who is responsible for an activity (and indirectly also for an entity).
Additionally PROV also allows to tag certain activities with time,
for example a timestamp when an entity was created.
      </p>
      <p>To add provenance information to our generated RDF data we
dene a prov:SoftwareAgent (a subclass of prov:Agent) with URI
&lt;http://data.wu.ac.at/portalwatch&gt;, cf. Figure 5. Since our
Portal Watch framework generates weekly snapshots of portals, i.e.,
weekly versions of the datasets of a data portal, and also assesses
the quality of these fetched datasets, we associate such a snapshot
with a prov:Activity which generated the DCAT representation
of the dataset and the respective quality measurements. e
measurements were computed on the DCAT dataset descriptions which
is modelled using the prov:wasDerivedFrom property.</p>
      <p>Regarding the (heuristically) generated CSVW metadata, we
annotate all :csv resources (cf. Section 5) as prov:Entity and
associate them with a prov:Activity with URI &lt;http://data.wu.ac.
at/portalwatch/csvw/fsnapshotg&gt; for a corresponding snapshot.
ese activities represent the weekly performed metadata/dialect
extraction on the CSVs. Additionally, we add the triple :csv
prov:wasDerivedFrom - CSV-url, to indicate that the CSVW-metadata
entities were constructed based on the existing CSV resources.</p>
      <p>DATA ACCESS &amp; CLIENT INTERFACES
is section describes how the generated RDF data is connected and
how we enable access to this data. In the previous sections we
described four dierent datasets: (i) the homogenized representation
of metadata descriptions (using the DCAT vocabulary), (ii) quality
measurements of these descriptions along several dimensions, (iii)
additional table schema and dialect descriptions for CSV resources,
and (iv) provenance information for the generated RDF data.</p>
      <p>In the example graph in Figure 6 bold edges and bold nodes
represent the properties and resources which connect these four
generated datasets. e corresponding classes for the main entities
are depicted using dashed nodes.</p>
      <p>In the following we introduce the public SPARQL endpoint for
querying the generated data and the implemented Memento APIs
which provide access to the archived datasets by using datetime
negotiation.
7.1</p>
    </sec>
    <sec id="sec-7">
      <title>SPARQL endpoint</title>
      <p>We make the mapped DCAT metadata descriptions and their
respective quality assessments available via a SPARQL endpoint located
at the Portal Watch framework (hp://data.wu.ac.at/portalwatch/
sparql). Currently, we loaded three snapshots of the generated
data in the RDF triple store (week 2, 3, and 4 in 2017), where each
snapshot is published as a named graph. ese snapshots consist of
about 120 million triples each. However, the numbers are varying
because we observe server errors for certain portals and therefore
we are not able to harvest the same number of dataset descriptions
every week. e underlying system is OpenLink Virtuoso.22</p>
      <p>In order to describe the quality metrics and dimensions of the
Portal Watch framework we dene URLs which refer to the
respective denitions (using the pwq namespace). Additionally, the
endpoint re-uses the namespaces as displayed in Listing 1.
P R E F I X dcat : &lt; http :// www . w3 . org / ns / dcat # &gt;
P R E F I X dqv : &lt; http :// www . w3 . org / ns / dqv # &gt;
P R E F I X prov : &lt; http :// www . w3 . org / ns / prov # &gt;
P R E F I X csvw : &lt; http :// www . w3 . org / ns / csvw # &gt;
P R E F I X pwq : &lt; http :// data . wu . ac . at /</p>
      <p>p o r t a l w a t c h / q u a l i t y # &gt;</p>
      <sec id="sec-7-1">
        <title>Listing 1: Used namespaces</title>
        <p>7.1.1 Exploring datasets. e SPARQL endpoint allows users to
explore and search datasets across data portals and nd common
descriptions and categories.</p>
        <p>For instance, the query in Listing 2 returns all portals in the
Portal Watch system which use transportation as a keyword/tag (in
total 31 portals).</p>
        <p>
          7.1.2 Metadata quality comparison and aggregation. e SPARQL
endpoint also allows to compare and lter datasets across dierent
portals, and the aggregation of quality metrics on dierent levels.
In order to enable a standardized access of the harvested and
archived dataset descriptions of the Portal Watch framework we
use the HTTP-based Memento framework [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We implemented
paern 2 of the specication, “A Remote Resource Acts as a TimeGate
for the Original Resource”, which we detail in the follwing.
Original Resource (URI-R): e Original Resource is a link to the
resource for which our framework provides prior states.
In our implementation this URI-R is the landing page for a
dataset description at a specic data portal. For instance,
the URI-R uri-r23 is an available dataset description at the
Austrian data portal data.gv.at.
        </p>
        <p>Time Gate (URI-G): e TimeGate URI for an URI-R is a resource
provided by our Memento implementation that oers
datetime negotiation in order to support access to the archived
version of the original resource. e URI-G for the a specic
dataset is available at &lt;http://data.wu.ac.at/portalwatch
/api/v1/memento/ fportalidg / fdatasetidg &gt; using the
internal portal-ID and the dataset’s ID; e.g., uri-g24 for the
above dataset.</p>
        <p>Memento: A Memento for an URI-R is a resource which provides a
specied prior state of the original resource. e Memento
for the a dataset description is available at</p>
        <p>&lt;http://data.wu.ac.at/portalwatch /api/v1/memento/
fdateg / fportalidg / fdatasetidg &gt;, where date follows
the paern YYYY&lt;MMjDDjHHjMMjSS&gt; (the parameters within
&lt; and &gt; are optional ). e Memento for a specic given
date is dened as the closest available version aer the
given date. For instance, the archived version for the
example dataset uri-r can be accessed at uri-m25; this URI
returns the archived dataset description closest aer
January 1 2017.</p>
        <p>23uri-r:
&lt;https://www.data.gv.at/katalog/dataset/add66f20-d033-4eeeb9a0-47019828e698&gt;</p>
        <p>24uri-g: &lt;http://data.wu.ac.at/portalwatch/api/v1/memento/data gv at/
add66f20-d033-4eee-b9a0-47019828e698&gt;</p>
        <p>25uri-m: &lt;http://data.wu.ac.at/portalwatch/api/v1/memento/data gv at/
20170101/add66f20-d033-4eee-b9a0-47019828e698&gt;</p>
        <p>In our implementation we oer these Mementos (i.e., prior
versions) with explicit URIs in dierent ways: (i) we provide access
to the original dataset descriptions retrieved from the data
portals’ APIs (e.g., uri-m which returns the archived JSON metadata
retrieved from a CKAN data portal), (ii) the dataset description
mapped to the DCAT vocabulary (using the sux /dcat for the
URI-T and Memento resources), or Schema.org vocabulary (using
sux /schemadotorg), serialized as JSON-LD, and (iii) the
quality assessment results in the DQV vocabulary (using sux /dqv),
serialized as JSON-LD.</p>
        <p>
          Datetime negotiation. e Memento framework species a
mechanism to access prior versions of Web resources based on the level
of HTTP request and response headers. It introduces the
“AcceptDatetime” and “Memento-Datetime” HTTP header elds and
extends the existing “Vary” and “Link” headers [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In order to
support datetime negotiation within our Memento
implementation we implemented these headers for the available URI-G and
Memento resources.
        </p>
        <p>Our framework implementation follows a 200 negotiation style:
a request to the TimeGate URI of a resource has a “200 OK” HTTP
status code and already returns the requested Memento. To indicate
that our TimeGate URIs are capable of datetime negotiation the
“Vary” header includes the “accept-datetime” value (cf. Listing 5).
Since the original dataset descriptions, i.e., the URI-Rs, are hosted
by remote servers we cannot support Memento-compliant HTTP
headers for these resources.</p>
        <p>In order to retrieve a archived version, a request to the TimeGate
of a resource can include the “Accept-Datetime” HTTP header. is
header indicates that the user wants to access a past state of the
resource. If this header is not present, our implementation will
return the most recent version of the resource (i.e., the most recent
archived dataset description). Otherwise, the response to this
request is the closest version of the resource aer the transmied
datetime header value, i.e, the corresponding Memento. For instance, in
Listing 4 a request to uri-g is issued including an “Accept-Datetime”.
HEAD / portalwatch / api / v1 / memento / data_gv_at /
add66f20 - d033 -4 eee - b9a0 -47019828 e698 HTTP /1.1
Host : data . wu . ac . at
Accept - Datetime : Sun , 01 Jan 2017 10:00:00 GM</p>
        <p>Listing 4: Request Datetime Negotiation with uri-g
e response header to such a datetime negotiation request with
the URI-G of a resource includes the “Memento-Datetime” header
which expresses the archival datetime of the Memento. Further, it
includes the “Content-Location” header which explicitly directs to
the Memento URI, i.e., to the distinct URI of the archived resource.
e “Link” header contains URI-R with the “original” relation type
(the link to the original dataset description) and URI-G with the
“timegate” relation type. ese header elds are also included in all
Memento URIs’ response headers, e.g., also in the header of uri-m.</p>
        <p>Listing 5 shows the HTTP response header to the request to uri-g
in Listing 4. is header includes the crawl time of the archived
dataset in the “Memento-Datetime” header and provides a direct
link to the Memento in the “Content-Location” header. e “Link”
header includes the reference to the original dataset at the data
portal.</p>
        <p>HTTP /1.0 200 OK
Content - Type : application / json
Memento - Datetime : Sun , 25 Dec 2016 23:00:00 GMT
Link : &lt; http :// www . data . gv . at / katalog / dataset /
add66f20 - d033 -4 eee - b9a0 -47019828 e698 &gt;; rel ="
original ",
&lt; http :// data . wu . ac . at / portalwatch / api / v1 /
memento / data_gv_at / add66f20 - d033 -4 eee
b9a0 -47019828 e698 &gt;; rel =" timegate "
Vary : accept - datetime
Content - Location : http :// data . wu . ac . at / portalwatch
/ api / v1 / memento / data_gv_at /20161226/ add66f20
d033 -4 eee - b9a0 -47019828 e698
Content - Length : 11237
Date : Mon , 16 Jan 2017 16:30:21 GMT</p>
      </sec>
      <sec id="sec-7-2">
        <title>Listing 5: Response from uri-g to request of Listing 4</title>
        <p>8</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>In this work we have extended the existing Portal Watch system
such that it re-exposes the dataset descriptions of 261 data portals
as RDF data using the DCAT and Schema.org vocabulary. We
additionally publish quality measurements along several dimensions for
each dataset descriptions, using the W3C’s Data ality vocabulary,
and we further enriched the dataset descriptions by automatically
generated metadata for CSV resources such as the column headers,
column datatypes and CSV delimiter. Also, in order to ensure
traceability of the published RDF data, the mapped/generated dataset
descriptions and respective measurements contain provenance
annotations. To allow users access to archived versions of the dataset
descriptions the Portal Watch framework oers APIs based on the
Memento framework: time-based content negotiation on top of
the HTTP protocol. As a next step for our framework we plan to
address the following issues:
8.1</p>
    </sec>
    <sec id="sec-9">
      <title>Future Work</title>
      <p>Automatically generate richer CSVW metadata. We plan to
improve the CSV analysis and generated richer CSVW metadata. For
example, the column datatypes of the CSVW metadata are based
on the XSD datatype denitions. ese types are hierarchically
dened (e.g., a positive integer is also a integer, is also a decimal).
More advanced heuristics can be applied to the values in order to
generated more ne grained datatypes. For instance, the
specication allows to dene paerns for date(time) columns which could
be automatically detected by such an heuristic.</p>
      <p>
        Complementary, we want to further improve the assisted
generation of CSVW metadata by combining our dialect and datatype
detection with approaches to (semi-)automatically annotate classes
to entities and properties to columns in CSVs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Representing snapshots as historical data. In the Portal Watch
framework a weekly snapshot of the monitored portals is stored
together with the quality assessments. In the triple store, the
generated RDF is then stored for each snapshot as a new named graph.
However, one might be interested in asking queries such as “How
regular does the metadata of this this dataset change?”, “When did
the last change to a certain metadata eld occur?”, or “How did the
quality of a dataset evolve over time?” ; the current data model is not
sucient (or not practicable) for such issues.</p>
      <p>Also we have to deal with scalability issues considering the
currently produced number of generated triples. e Portal Watch
framework monitors and archives (in a relational database) the
metadata descriptions for 250 portals already for about one year.
Assuming that also the previous snapshots consist of about 120
million triples per snapshot for the archived versions, we could
very roughly estimate the number of total triples to 6 billions (50
weeks</p>
      <p>120M triples). If we also assume that we want to keep
up our service in the future and that the number of datasets and
portals will further increase, we have to investigate on how we can
store the data eciently while maintaining the services to retrieve
and use the data.</p>
      <p>
        ere are already several ongoing approaches which try to cope
with these issues: In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] Fernandez et al. benchmark existing RDF
archiving techniques along several aspects such as storage space
eciency, retrieval functionality, and performance of various retrieval
operations. e authors identify three main archiving strategies for
RDF: (i) storing independent copies for each version corresponds to
our current approach of dierent named graphs for each snapshot.
To address the scalability issue of this strategy, (ii) change-based
approaches compute and store the deltas between versions.
Alternatively, (iii) in timestamp-based approaches each triple is annotated
with its temporal validity.
      </p>
      <p>
        A recent approach by Fionda et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes a framework for
querying RDF data over time by extending SPARQL. is extension
inherits temporal operators from Linear Temporal Logics, e.g.,
PREVIOUS, ALWAYS, or EVENTUALLY. A logical and necessary next
step for our metadata archive is to select and implement a suitable
model.
      </p>
      <p>Interlink datasets and connect to external knowledge. e
metadata, as it is currently published at our Portal Watch framework, is
only partially interlinked and there are hardly any links to
external knowledge bases. e reason for this is that the origin portal
frameworks (e.g. CKAN, Socrata) do not provide options to describe
related/associated datasets, or options to describe the datasets using
external vocabularies or to add links to classes and properties in
external sources.</p>
      <p>
        In order to add such links and connections we plan to extract
labels, properties and classes from the actual data sources and use
these to enrich the metadata and establish connections between
datasets. ere is already an extensive body of research in the
Semantic Web community to derive such semantic labels which
can be build upon [
        <xref ref-type="bibr" rid="ref1 ref14 ref9">1, 9, 14</xref>
        ].
      </p>
      <p>
        A recent approach by Tygel et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] tries to establish links
between Open Data portals by extracting the tags/keywords of
the dataset descriptions and merging them (using translations, and
similarity measures) at a tag server, where they provide unique
URIs for these tags. e tags are further described using relations
such as skos:broader, owl:sameAs and muto:hasMeaning. We
will investigate how and if we can use this service to connect our
generated RDF data to these tag descriptions.
      </p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGEMENTS</title>
      <p>is work has been supported by the Austrian Research Promotion
Agency (FFG) under the project ADEQUATe (grant no. 849982).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Marco</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Adel</surname>
          </string-name>
          <article-title>o and Hanan Samet. Schema extraction for tabular data on the web</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>6</volume>
          (
          <issue>6</issue>
          ):
          <fpage>421</fpage>
          -
          <lpage>432</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Keith</given-names>
            <surname>Alexander</surname>
          </string-name>
          , Richard Cyganiak,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Describing Linked Datasets with the VoID Vocabulary</article-title>
          . hps://www.w3.org/TR/void/,
          <year>March 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ahmad</given-names>
            <surname>Assaf</surname>
          </string-name>
          , Raphae¨l Troncy,
          <article-title>and Aline Senart. HDL - Towards a harmonized dataset model for open data portals</article-title>
          .
          <source>In PROFILES</source>
          <year>2015</year>
          , 2nd International Workshop on Dataset Pro
          <article-title>ling &amp; Federated Search for Linked Data</article-title>
          ,
          <source>Main conference ESWC15</source>
          , 31 May-4
          <source>June</source>
          <year>2015</year>
          , Portoroz, Slovenia, Portoroz, Slovenia,
          <fpage>05</fpage>
          <lpage>2015</lpage>
          .
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Javier</given-names>
            <surname>David Fernandez Garcia</surname>
          </string-name>
          , Ju¨rgen Umbrich, and
          <string-name>
            <given-names>Axel</given-names>
            <surname>Polleres</surname>
          </string-name>
          .
          <article-title>Bear: Benchmarking the eciency of rdf archiving</article-title>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Valeria</given-names>
            <surname>Fionda</surname>
          </string-name>
          , Melisachew W Chekol,
          <article-title>and Guiseppe Pirro`. A time warp in the web of data</article-title>
          .
          <source>In 15h Int. Semantic Web Conference (ISWC) Posters and Demos</source>
          , Kobe, Japan,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Fu</surname>
          </string-name>
          <article-title>¨rber and Martin Hepp</article-title>
          .
          <article-title>Towards a vocabulary for data quality management in semantic web architectures</article-title>
          .
          <source>In Proceedings of the 1st International Workshop on Linked Web Data Management, LWDM '11</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Lebo</surname>
          </string-name>
          , Satya Sahoo, and
          <article-title>Deborah McGuinness</article-title>
          .
          <article-title>PROV-O: e PROV ontology</article-title>
          . hp://www.w3.org/TR/2013/REC-prov-o-
          <volume>20130430</volume>
          /,
          <year>April 2013</year>
          . W3C Recommendation.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Fadi</given-names>
            <surname>Maali</surname>
          </string-name>
          and
          <string-name>
            <given-names>John Erickson. Data</given-names>
            <surname>Catalog</surname>
          </string-name>
          <article-title>Vocabulary (DCAT)</article-title>
          . hp://www. w3.org/TR/vocab-dcat/,
          <year>January 2014</year>
          . W3C Recommendation.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , Ju¨rgen Umbrich,
          <article-title>Josiane Xavier Parreira, and Axel Polleres. Multi-level semantic labelling of numerical values</article-title>
          .
          <source>In e Semantic Web - ISWC 2016 - 15th International Semantic Web Conference</source>
          , Kobe, Japan,
          <source>October 17-21</source>
          ,
          <year>2016</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , pages
          <fpage>428</fpage>
          -
          <lpage>445</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , Ju¨rgen Umbrich, and
          <string-name>
            <given-names>Axel</given-names>
            <surname>Polleres</surname>
          </string-name>
          .
          <article-title>Automated quality assessment of metadata across open data portals</article-title>
          .
          <source>J. Data and Information ality</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ):2:
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>29</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rufus</surname>
            <given-names>Pollock</given-names>
          </string-name>
          , Jeni Tennison, Gregg Kellogg, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Herman</surname>
          </string-name>
          .
          <article-title>Metadata Vocabulary for Tabular Data</article-title>
          . hps://www.w3.org/TR/2015/ REC-tabular-metadata-
          <volume>20151217</volume>
          /,
          <year>December 2015</year>
          . W3C Recommendation.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Alan</surname>
            <given-names>Tygel</given-names>
          </string-name>
          , So¨ren Auer, Jeremy Debaista, Fabrizio Orlandi, and Maria Luiza Machado Campos.
          <article-title>Towards cleaning-up open data portals: A metadata reconciliation approach</article-title>
          .
          <source>In Tenth IEEE International Conference on Semantic Computing, ICSC</source>
          <year>2016</year>
          , Laguna Hills, CA, USA, February 4-
          <issue>6</issue>
          ,
          <year>2016</year>
          , pages
          <fpage>71</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Herbert Van de Sompel</surname>
            , Michael Nelson, and
            <given-names>Robert</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>HTTP framework for time-based access to resource states -</article-title>
          <source>Memento</source>
          ,
          <year>2013</year>
          . RFC 7089.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Petros</surname>
            <given-names>Venetis</given-names>
          </string-name>
          , Alon Y. Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and
          <string-name>
            <given-names>Chung</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Recovering semantics of tables on the web</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>9</issue>
          ):
          <fpage>528</fpage>
          -
          <lpage>538</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Ziqi</given-names>
            <surname>Zhang</surname>
          </string-name>
          . E
          <article-title>ective and ecient semantic table interpretation using tableminer+</article-title>
          .
          <source>Semantic Web</source>
          , (Preprint):
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>