=Paper= {{Paper |id=None |storemode=property |title=Reflections on: DCAT-AP Representation of Czech National Open Data Catalog and its Impact |pdfUrl=https://ceur-ws.org/Vol-2576/paper05.pdf |volume=Vol-2576 |authors=Jakub Klimek |dblpUrl=https://dblp.org/rec/conf/semweb/Klimek19 }} ==Reflections on: DCAT-AP Representation of Czech National Open Data Catalog and its Impact== https://ceur-ws.org/Vol-2576/paper05.pdf
      Reflections on: DCAT-AP Representation of
      Czech National Open Data Catalog and its
                        Impact?

                                      Jakub Klímek[0000−0001−7234−3051]

      Department of Software Engineering, Faculty of Mathematics and Physics,
     Charles University, Malostranské náměstí 25, 118 00 Praha 1, Czech Republic
                               klimek@ksi.mff.cuni.cz
                             https://jakub.klímek.com



          Abstract. Open data is now a heavily discussed topic around the world
          and in the European Union. In the Czech Republic, open data is a term
          anchored in legislation, which includes the requirement of registration
          of all open data in the Czech National Open Data Portal (NODC). In
          the journal paper [5] we describe the NODC, its architecture, dataset
          registration processes including the harvesting of Local Open Data Cat-
          alogs (LODCs), proprietary XML API and its obsolete dataset viewer.
          Next we describe the process of transformation of the NODC metadata
          to the DCAT-AP v1.1 RDF representation from the data model point of
          view and from the technical environment point of view. We describe the
          dataset quality measurements computed using the new data representa-
          tion and its further impact on the Linked Open Data (LOD) environment
          including the harvesting of the metadata by the European Data Portal
          (EDP). Finally, we evaluate the data transformation and publishing en-
          vironment from the usability, portability, availability and performance
          perspectives.

          Keywords: open data · catalog · DCAT-AP · Linked Data


1      Introduction

Open data is currently a hot topic among institutions of public administration
and data users around the world. From the political point of view, publishing
open data is important for public administration institutions to show that they
are transparent and open to citizens. From the legal point of view it is impor-
tant that the data is published using an open license, permitting users to use
the data freely. From the technical point of view, it is important that the data
is published as machine readable data in an open format, accessible on the web
with minimal effort. From the point of view of potential open data users it is im-
portant that the data can be searched for and found. Finally, from the economic
?
    This work was supported by the Czech Science Foundation (GAČR), grant number
    19-01641S.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2      J. Klímek

point of view, open data is expected to support creation of new services and new
business models1 . At the intersection of all of these points of view lay open data
portals of the data publishing institutions, which typically include open data
catalogs where datasets can be found along with the metadata describing them.
The metadata descriptions contain the necessary information about the licenses
of datasets, formats of their distributions, and the textual descriptions, all of
which can be used for dataset search. This results in a number of open data cat-
alogs, typically one per each institution willing to publish open data. Therefore,
a problem with discoverability of the data catalogs and the individual datasets
described in them arises, and along with it a need for aggregate views over multi-
ple data catalogs. To address this need, a standard for representation of dataset
metadata, the Data Catalog Vocabulary (DCAT) [3], was developed by the W3C
to enable dataset metadata exchange among data catalogs, and specifically to
support hierarchies of catalogs. In the European Union, an application profile of
DCAT, the DCAT-AP v1.12 , has been developed, further specifying for instance
controlled vocabularies to be used to describe datasets and their distributions.
At the same time, a top level European data catalog, the European Data Portal
(EDP)3 , has been developed with the intent of aggregating data catalogs from
the whole EU.
    In the Czech Republic, the Ministry of the Interior (MoI) is in charge of the
open data agenda. The NODC is run by the MoI, and the public administration
institutions can either register their datasets directly in the NODC, or, prefer-
ably, they can run their own Local Open Data Catalog (LODC) which, after
registration, gets automatically harvested to the NODC. Unfortunately, until
now, the MoI did not consider DCAT-AP v1.1 as a metadata publishing format.
The NODC internal data model was inspired by the DCAT-AP v1.0, however,
it is XML based and accessible only via a proprietary XML API. In addition,
the metadata registered in the NODC was only viewable to users via a rather
unfriendly user interface.
    As the NODC became known and used, new requirements on its functionality
arose, mainly the need to monitor dataset metadata quality, and the need to
be harvested by the EDP. To address these requirements, we first transform the
current NODC metadata to the DCAT-AP v1.1 RDF representation and publish
it. Then we use this new representation to compute the dataset metadata quality
measurements and to facilitate the harvesting of the NODC metadata by the
EDP. Similar requirements are identified also elsewhere, e.g. in Serbia [4], as
requirements to implement the revised European Directive on the Public Sector
Information (2013/37/EU) emphasizing the role of the Linked Data approach
for improved interoperability and re–use.


1
  https://www.europeandataportal.eu/en/highlights/
  economic-benefits-open-data
2
  https://joinup.ec.europa.eu/release/dcat-ap-v11
3
  https://www.europeandataportal.eu
            DCAT-AP Representation of Czech National Open Data Catalog             3

1.1   Contributions

In the journal paper [5] we describe the NODC and its architecture, data in-
put, data model, its API and the mechanism of harvesting the metadata from
LODCs to the NODC. We describe how we transform the current metadata
to the DCAT-AP v1.1 RDF representation both from the data model point of
view and from the technical environment point of view. We show the metadata
quality measures that we compute using the DCAT-AP v1.1 representation and
we evaluate further impact this new metadata representation has, including a
new frontend for viewing the dataset metadata and the effects it has on the
Linked Open Data (LOD) environment. Finally, we evaluate the data transfor-
mation environment from the usability, portability, availability and performance
perspectives.


1.2   Outline

The rest of this extended abstract is structured according to the original paper
[5], providing an outline of the paper by summarizing the contents of each of the
sections.
     In Section 2 we describe the NODC, its architecture, the processes of har-
vesting LODCs, and its proprietary XML API. In Section 3 we describe the
data transformation to DCAT-AP v1.1 and the linking to related datasets and
code lists. In Section 4 we describe the technical environment used for the data
transformation and publication process. In Section 5 we describe data qual-
ity measures which we compute based on the transformed data. In Section 6
we show additional impact of the published dataset by presenting some of the
known usages of the published data, which include harvesting by the European
Data Portal. In Section 7 we evaluate the data transformation and publishing
environment according to various criteria. Finally we survey related work and in
Section 8 we conclude.


2     Czech National Open Data Catalog (NODC)

The institutions in the Czech public administration are required to register their
published data in the Czech National Open Data Catalog (NODC) before they
can call it open data. There are two ways of registering to the NODC. For
smaller institutions such as village councils there is the possibility of registering
individual datasets directly in the NODC using a form to fill in all the necessary
metadata. For larger institutions such as ministries or city councils, there is the
possibility of registering their own local Open Data catalog (LODC), as there is
an assumption that such institutions will have their own data portals anyway.
Once a LODC is registered, the metadata from it is automatically and regularly
harvested by the NODC, giving the institutions more flexibility in their dataset
management. The registered datasets can be viewed in a rather unfriendly web
4      J. Klímek

user interface, which does not provide many features known from wide-spread
data catalog implementations such as CKAN4 and DKAN5 .


3   Data modeling and transformations
Our goal is to publish the dataset metadata from the NODC according to the
DCAT-AP v1.1 specification. The proprietary NODC XML API will serve as our
data source. In this section of the original paper [5] the RDF vocabularies used
in the transformed data and the legacy code lists and data items are showed and
mapped to the European Union Metadata Registry Named Authority Lists (EU
MDR NALs), now parts of EU Vocabularies6 , and the RTIAR.


4   Data transformation and publishing environment
In this section of the paper [5] we describe the technical environment used to
transform the data from the original NODC proprietary XML API to a DCAT-
AP v1.1 dataset published as Linked Open Data. The environment is built on
open-source tools.
    The transformation is done using LinkedPipes ETL [7], which needs Java7
and Node.js8 to run, and Git9 and Apache Maven10 to build the source code
from the GitHub repository11 .
    LinkedPipes ETL is an open-source ETL tool for production and consump-
tion of Linked Data, which is in use by multiple organizations in the Czech
public administration. It is also used in the OpenBudgets.eu platform [10] for
publication and transformation of fiscal data. It runs the data transformation
process as a so called pipeline. The process is run daily, as that corresponds
to the periodicity of updates of the source NODC data. The pipeline has the
following principal steps:
 1. Get metadata from the proprietary NODC XML API
 2. Get the data box ID to publisher IRI and name mapping from the List of
    public administration authorities dataset
 3. Transform the metadata using an XSLT [2] template
 4. Map the ISO8601 frequencies to the Frequency EU MDR NAL
 5. Add the File Type EU MDR NAL items based on distribution MIME Types
 6. Get the previous DCAT-AP v1.1 dump and compare with the current version
    to generate statistics about new, changed and deleted datasets using the
    RDF Data Cube Vocabulary (DCV) [11]
4
   https://github.com/ckan/ckan
5
   https://getdkan.org/
 6
   https://publications.europa.eu/en/web/eu-vocabularies
 7
   https://www.oracle.com/java/index.html
 8
   https://nodejs.org
 9
   https://git-scm.com/
10
   https://maven.apache.org/
11
   https://github.com/linkedpipes/etl
             DCAT-AP Representation of Czech National Open Data Catalog        5

 7. Compute metadata of the DCAT-AP v1.1 dataset itself
 8. Load an index to Apache Solr
 9. Load the metadata records to Apache CouchDB
10. Load the RDF TriG dump to a web server
11. Load the RDF data to the SPARQL endpoint
12. Run the pipelines computing data quality measurements (see Section 5)


5      Metadata quality measures

In this section of the paper [5] we describe quality measures monitored using the
NODC loaded in a SPARQL endpoint. We distinguish two types of measures,
one is based solely on what can be found in the metadata itself. The second type
uses the metadata registered in NODC to try to access the linked resources, i.e.
licenses, schemas, documentation and the distributions themselves, and check
whether they are served correctly, e.g. with a correct Media Type. To compute
both types of quality measures, pipelines in LinkedPipes ETL are used. Since
the quality measures are based on the DCAT-AP v1.1 specification, they are
directly reusable for other DCAT-AP v1.1 compliant datasets.


5.1     Metadata based quality measures

Since there is no validation of input data in the current NODC harvester of
LODCs, there are metadata records violating DCAT-AP v1.1 constraints or
constraints dictated by the Czech legislation. The first part of the quality mea-
sures in this section aims at detecting such anomalies. In addition, there are
measures aiming at providing an overview of common practice. The results of
these measures are published on the Czech Open Data Portal12 . The measures
are:

 1. Number of distributions with unspecified license per publisher
 2. Number of datasets with distributions with unspecified licenses per publisher
 3. Number of datasets missing required attributes per publisher
 4. List of datasets missing required attributes per publisher
 5. Number of distributions with a given mime type per publisher
 6. Distribution licenses per publisher
 7. Number of publishers per license
 8. Number of datasets with a given accrual periodicity per publisher
 9. Number of datasets and distributions per publisher

    These measures were selected based on the most frequently appearing errors
in the metadata records. These erroneous records cause a decline in the overall
metadata quality in the NODC. When the users encounter them, they tend to
blame the NODC for the inconsistent looking record, therefore, it is important
to us that the publisher correct their records. This is also why it is important
12
     https://opendata.gov.cz/statistika:datova-kvalita (in Czech only)
6       J. Klímek

to consistently point out errors in the records and demand their correction.
From our experience, the most effective way to achieve the correction in the
(Czech) public administration is to publicly display the errors attributed to
the originating publishers, along with clear instructions on how to correct the
mistakes. The errors in the records also usually reveal a deeper problem with
data management at the original publisher.

5.2   Web access based quality measures
The measures listed in this section check whether the resources linked from the
metadata records actually exist, and whether they are served correctly. There
are four types of resources linked from the metadata, i.e. distributions, licenses,
dataset documentation and distribution schema. For each of the four resource
types we compute a summary statistic and a list of offending resources.
    Here, we list the quality measures along with the description of the individual
columns in the result.
 1. Statistics of unavailable dataset distributions per publisher
 2. List of unavailable distributions
 3. Statistics of unavailable schemas of dataset distributions per publisher
 4. List of unavailable distribution schemas
 5. Statistics of unavailable licenses of dataset distributions per publisher
 6. List of unavailable distribution licenses
 7. Statistics of unavailable documentation of datasets per publisher
 8. List of unavailable dataset documentation
    In addition to the availability measures described above, there is one more
measure dealing with inconsistency between the distribution Media Type regis-
tered in the NODC and the Media Type returned by the web server serving the
distribution.

6     Additional impact of publishing NODC as Linked Open
      Data using DCAT-AP v1.1
In this section of [5], we demonstrate the impact of publishing the NODC con-
tents as Linked Open Data according to the DCAT-AP v1.1 specification besides
us being able to compute the quality measures described in Section 5 by describ-
ing the effects it has on the LOD environment.

6.1   Promotion of Linked Open Data and usage of standardized
      vocabularies
Theoretical advantages of LOD described in existing literature are not convincing
enough to the representatives of public administration institutions regarding
publishing their data as LOD. We are using the example of the NODC and others,
such as the Czech Social Security Administration [6], and the infrastructure
used to process the data and publish it as DCAT-AP v1.1 to convince other
institutions that publishing LOD is possible without excessive resources.
             DCAT-AP Representation of Czech National Open Data Catalog        7

6.2     Harvesting by the European Data Portal
A clear added value of the DCAT-AP v1.1 representation of the NODC data is
the ability to be harvested by the European Data Portal (EDP), a well-known
pan-European open data catalog, using a native LOD way. In fact, this was
beneficial not only to the Czech publishers, as their metadata got published
on the European level, but also to the developers of the European Data Por-
tal. This is because the Czech NODC was the first European data portal to
publish the metadata using DCAT-AP v1.1 in RDF natively, i.e. as an RDF
data dump, dereferencable IRIs and a SPARQL endpoint. At the same time, the
Czech NODC is the largest catalog in EDP.

6.3     Ability to use LinkedPipes DCAT-AP Viewer as frontend
The original NODC viewer was quite unfriendly to the users. However, the pro-
prietary nature of the published metadata made it hard to convince someone
to develop an alternative frontend. Thanks to the transformation of the data to
DCAT-AP v1.1 [8] we are now able to use the LinkedPipes DCAT-AP Viewer
which is friendlier (based on System Usability Scale (SUS)13 ) and offers more
functionality than the original, including multilingual user interface exploiting
the multilingual EU MDR NALs where possible, full text search, keywords word
cloud, handling of large numbers of distributions, etc.

6.4     Proof of need for a Linked Data Consumption Platform
In our research group we are focusing not only on LOD publishing, but also LOD
consumption, where we identified a distinct lack of tools for actually using LOD
that is published [9]. This lack of tools is proven every time we publish a new
dataset, as users are saying that they do not know how to consume LOD and
that they require data in formats they are used to, i.e. CSV, JSON and XML
files. The case with NODC was no different. After publishing the data in RDF
and in the SPARQL endpoint, some users demanded CSV exports of the data,
even though they could be obtained using a simple SPARQL SELECT query.
The users need a tool we call a Linked Data Consumption Platform (LDCP),
which would help them in consuming LOD, and ideally, it would be easier to use
than tools for consumption of CSV, JSON and XML files thanks to the benefits
LOD brings, and it may even not require any specific LOD related knowledge.
The users who demand those non-LOD representations of data then serve us as
motivation for our LDCP related efforts.

7      Environment evaluation
In this section of [5], we evaluate the LOD publishing environment described in
Section 4 used by the Ministry of the Interior of the Czech Republic (MoI) to
13
     https://www.usability.gov/how-to-and-tools/methods/
     system-usability-scale.html
8        J. Klímek

prepare and publish the DCAT-AP v1.1 NODC RDF dataset. We evaluate the
environment using so called quality attributes as introduced in [1]. We evaluate
the following quality attributes:

    – usability
    – portability
    – availability
    – performance

   The environment integrates various open-source tools. We do not evaluate
each individual tool but the environment as a whole.


8      Conclusions

In the paper [5] we describe the current architecture of the Czech National Open
Data Catalog (NODC), starting from manual data entry, Local Open Data Cat-
alogs (LODCs) harvesting, data storage to data publication using its proprietary
XML API and a rather unfriendly viewer. Next we described the vocabularies
and codelists used in transformation of the NODC data to its DCAT-AP v1.1
RDF representation. We described the technical environment used for the data
transformation and data quality measures that can be computed using the RDF
data representation, both using LinkedPipes ETL. We describe the additional
impact of publishing NODC using DCAT-AP v1.1 and evaluate the data trans-
formation environment both from the perspective of the transformation designers
and from the perspective of the data users.
    During the time of writing the paper, the MoI changed their supplier of the
implementation of the original NODC, making it permanently unavailable, as
the new supplier was not able to take over the original implementation. This
nevertheless showed another benefit of publishing open data, as our copy of
the NODC used to demonstrate the transformation to DCAT-AP v1.1 and the
browsing of the data in LinkedPipes DCAT-AP Viewer remained the only ex-
isting publicly available NODC instance. Currently, it is already running as the
official NODC instance at https://data.gov.cz.
    Finally, we conclude the paper with a summarization of the lessons learned
during our work with the MoI and publishers of open data in the Czech Republic.


References

 1. Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice. Addison-
    Wesley Professional, 3rd edn. (2012)
 2. Clark, J.: XSL transformations (XSLT) version 3.0. W3C Recommendation, W3C
    (Jun 2017), https://www.w3.org/TR/1999/REC-xslt-19991116
 3. Erickson, J., Maali, F.: Data Catalog Vocabulary (DCAT). W3C Recommendation,
    W3C (Jan 2014), https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/
            DCAT-AP Representation of Czech National Open Data Catalog                9

 4. Janev, V., Mijovic, V., Vranes, S.: Proposal for Implementing the EU PSI Direc-
    tive in Serbia. In: Ko, A., Francesconi, E. (eds.) Electronic Government and the
    Information Systems Perspective - 5th International Conference, EGOVIS 2016,
    Porto, Portugal, September 5-8, 2016, Proceedings. Lecture Notes in Computer
    Science, vol. 9831, pp. 16–30. Springer (2016). https://doi.org/10.1007/978-3-319-
    44159-7_2, https://doi.org/10.1007/978-3-319-44159-7_2
 5. Klímek, J.: DCAT-AP representation of Czech National Open Data Cat-
    alog and its impact. Journal of Web Semantics 55, 69 – 85 (2019).
    https://doi.org/10.1016/j.websem.2018.11.001, http://www.sciencedirect.com/
    science/article/pii/S1570826818300532
 6. Klímek, J., Kučera, J., Nečaský, M., Chlapek, D.: Publication and usage of of-
    ficial Czech pension statistics Linked Open Data. Journal of Web Semantics
    48, 1 – 21 (2018). https://doi.org/10.1016/j.websem.2017.09.002, http://www.
    sciencedirect.com/science/article/pii/S1570826817300343
 7. Klímek, J., Škoda, P.: LinkedPipes ETL in use: practical publication and
    consumption of linked data. In: Indrawan-Santiago, M., Steinbauer, M., Sal-
    vadori, I.L., Khalil, I., Anderst-Kotsis, G. (eds.) Proceedings of the 19th In-
    ternational Conference on Information Integration and Web-based Applications
    & Services, iiWAS 2017, Salzburg, Austria, December 4-6, 2017. pp. 441–445.
    ACM (2017). https://doi.org/10.1145/3151759.3151809, https://doi.acm.org/
    10.1145/3151759.3151809
 8. Klímek, J., Škoda, P.: LinkedPipes DCAT-AP Viewer: A Native DCAT-AP Data
    Catalog. In: van Erp, M., Atre, M., López, V., Srinivas, K., Fortuna, C. (eds.) Pro-
    ceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas
    Tracks co-located with 17th International Semantic Web Conference (ISWC 2018),
    Monterey, USA, October 8th - to - 12th, 2018. CEUR Workshop Proceedings,
    vol. 2180. CEUR-WS.org (2018), http://ceur-ws.org/Vol-2180/paper-32.pdf
 9. Klímek, J., Škoda, P., Nečaský, M.: Requirements on Linked Data Consumption
    Platform. In: Auer, S., Berners-Lee, T., Bizer, C., Heath, T. (eds.) Proceedings
    of the Workshop on Linked Data on the Web, LDOW 2016, co-located with
    25th International World Wide Web Conference (WWW 2016). CEUR Workshop
    Proceedings, vol. 1593. CEUR-WS.org (2016), http://ceur-ws.org/Vol-1593/
    article-01.pdf
10. Musyaffa, F.A., Halilaj, L., Li, Y., Orlandi, F., Jabeen, H., Auer, S., Vidal,
    M.: OpenBudgets.eu: A Platform for Semantically Representing and Analyzing
    Open Fiscal Data. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) Web
    Engineering - 18th International Conference, ICWE 2018, Cáceres, Spain, June
    5-8, 2018, Proceedings. Lecture Notes in Computer Science, vol. 10845, pp.
    433–447. Springer (2018). https://doi.org/10.1007/978-3-319-91662-0_35, https:
    //doi.org/10.1007/978-3-319-91662-0_35
11. Reynolds, D., Cyganiak, R.: The RDF Data Cube Vocabulary. W3C
    Recommendation,        W3C      (Jan     2014),    https://www.w3.org/TR/2014/
    REC-vocab-data-cube-20140116/