=Paper= {{Paper |id=Vol-1376/paper07 |storemode=property |title=What's up LOD Cloud? Observing The State of Linked Open Data Cloud Metadata |pdfUrl=https://ceur-ws.org/Vol-1376/LDQ2015_paper_07.pdf |volume=Vol-1376 |dblpUrl=https://dblp.org/rec/conf/esws/AssafST15 }} ==What's up LOD Cloud? Observing The State of Linked Open Data Cloud Metadata== https://ceur-ws.org/Vol-1376/LDQ2015_paper_07.pdf
                          What’s up LOD Cloud?
                          Observing The State of
                    Linked Open Data Cloud Metadata

                 Ahmad Assaf1,2 , Raphaël Troncy1 and Aline Senart2
     1
         EURECOM, Sophia Antipolis, France, 
               2
                 SAP Labs France, 



          Abstract. Linked Open Data (LOD) has emerged as one of the largest
          collections of interlinked datasets on the web. In order to benefit from
          this mine of data, one needs to access descriptive information about
          each dataset (or metadata). However, the heterogeneous nature of data
          sources reflects directly on the data quality as these sources often contain
          inconsistent as well as misinterpreted and incomplete metadata informa-
          tion. Considering the significant variation in size, the languages used and
          the freshness of the data, one realizes that finding useful datasets without
          prior knowledge is increasingly complicated. We have developed Roomba,
          a tool that enables to validate, correct and generate dataset metadata.
          In this paper, we present the results of running this tool on parts of the
          LOD cloud accessible via the datahub.io API. The results demonstrate
          that the general state of the datasets needs more attention as most of
          them suffers from bad quality metadata and lacking some informative
          metrics that are needed to facilitate dataset search. We also show that
          the automatic corrections done by Roomba increase the overall quality
          of the datasets metadata and we highlight the need for manual efforts to
          correct some important missing information.

          Keywords: Dataset Profile, Metadata, Data Quality, Data Portal


1        Introduction
The Linked Open Data (LOD) cloud3 has grown significantly in the past years,
offering various datasets covering a broad set of domains from life sciences to
media and government data [4]. To maintain high quality data, publishers should
comply with a set of best practices detailed in [3]. Metadata provisioning is one of
those best practices requiring publishers to attach metadata needed to effectively
understand and use datasets.
    Data portals expose metadata via various models. A model should contain
the minimum amount of information that conveys to the inquirer the nature
and content of its resources [7]. It should contain information to enable data
discovery, exploration and exploitation. We divided the metadata information
into the following types:
3
    The datahub.io view of the LOD cloud is at http://datahub.io/dataset?tags=lod
2         Ahmad Assaf, Raphaël Troncy and Aline Senart

    – General information: General information about the dataset (e.g. title,
      description, ID). This general information is manually filled by the dataset
      owner. In addition to that, tags and group information is required for clas-
      sification and enhancing dataset discoverability.
    – Access information: Information about accessing and using the dataset.
      This includes the dataset URL, some license information (i.e. license title
      and URL) and information about the datasets resources. Each resource has
      generally a set of attached metadata (e.g. resource name, URL, format, size).
    – Ownership information: Information about the ownership of the dataset
      (e.g. organization details, maintainer details, author). The existence of this
      information is important to identify the authority on which the generated
      report and the newly corrected profile will be sent to.
    – Provenance information: Temporal and historical information on the
      dataset and its resources (e.g. creation and update dates, version informa-
      tion, version number). Most of this information can be automatically filled
      and tracked.

    Data portals are datasets’ access points providing tools to facilitate data
publishing, sharing, searching and visualization. CKAN4 is the world’s leading
open-source data portal platform powering web sites like the Datahub which
hosts the LOD cloud metadata.
    We have created Roomba [1], a tool that automatically validates, corrects
and generates dataset metadata for CKAN portals. The datasets are validated
against the CKAN standard metadata model5 . The model describes four main
sections in addition to the core dataset’s properties. These sections are:

    – Resources: The actual accessible raw data. They can come in various for-
      mats (JSON, XML, RDF, etc.) and can be downloaded or accessed directly
      (REST API, SPARQL endpoint).
    – Tags: Provide descriptive knowledge on the dataset content and structure.
    – Groups: Used to cluster or a curate datasets based on shared themes or
      semantics.
    – Organizations: Organizations describe datasets solely on their association
      to a specific administrative party.

The results demonstrate that the general state of the examined datasets needs
much more attention as most of the datasets suffers from bad quality meta-
data and lacking some informative metrics needed that would facilitate dataset
search. The noisiest metadata values were access information such as licensing
information and resource descriptions in addition to large numbers of resource
reachability problems. We also show that the automatic corrections of the tool
increase the overall quality of the datasets metadata and highlight the need for
manual efforts to correct some important missing information.
4
    http://ckan.org
5
    http://demo.ckan.org/api/3/action/package_show?id=adur_district_
    spending
                 Observing The State of Linked Open Data Cloud Metadata        3

2     Related Work

The Data Catalog Vocabulary (DCAT) [8] and the Vocabulary of Interlinked
Datasets (VoID) [6] are models for representing RDF datasets metadata. There
exist several tools aiming at exposing dataset metadata using these vocabularies
such as [5]. Few approaches tackle the issue of examining datasets metadata.
The Project Open Data Dashboard6 validator analyzes machine readable files
for automated metrics to check their alignment with the Open Data principles.
Similarly on the LOD cloud, the Datahub LOD Validator7 checks a dataset
compliance for inclusion in the LOD cloud. However, it lacks the ability to give
detailed insights about the completeness of the metadata and an overview on
the state of the entire LOD cloud group.
    The State of the LOD Cloud Report [2] measures the adoption of Linked
Data best practices back in 2011. More recently, the authors in [10] used LDSpi-
der [9] to crawl and analyze 1014 different datasets in the web of Linked Data
in 2014. While these reports expose important information about datasets like
provenance, licensing and accessibility, they do not cover the entire spectrum of
metadata categories as presented in [11].


3     Experiments and Evaluation

In this section, we describe our experiments when running the Roomba tool on
the LOD cloud. All the experiments are reproducible by our tool and their results
are available on its Github repository at https://github.com/ahmadassaf/
opendata-checker.


3.1    Experimental Setup

The current state of the LOD cloud report [10] indicates that there are more than
1014 datasets available. These datasets have been harvested by the LDSpider
crawler [9] seeded with 560 thousands URIs. However, since Roomba requires
the datasets metadata to be hosted in a data portal where either the dataset
publisher or the portal administrator can attach relevant metadata to it, we
rely on the information provided by the Datahub CKAN API. We consider two
possible groups: the first one tagged with “lodcloud” returns 259 datasets, while
the second one tagged with “lod” returns only 75 datasets. We manually inspect
these two lists and find out that the API result for the tag “lodcloud” is the
correct one. The 259 datasets contain a total of 1068 resources. We run the
instance and resource extractor from Roomba in order to cache the metadata
files for these datasets locally and we launch the validation process which takes
around two and a half hours on a 2.6 Ghz Intel Core i7 processor with 16GB of
DDR3 memory machine.
6
    http://labs.data.gov/dashboard/
7
    http://validator.lod-cloud.net/
4       Ahmad Assaf, Raphaël Troncy and Aline Senart

3.2   Results and Evaluation



CKAN dataset metadata includes three main sections in addition to the core
dataset’s properties. Those are the groups, tags and resources. Each section
contains a set of metadata corresponding to one or more metadata type. For
example, a dataset resource will have general information such as the resource
name, access information such as the resource url and provenance information
such as creation date. The framework generates a report aggregating all the
problems in all these sections, fixing field values when possible. Errors can be
the result of missing metadata fields, undefined field values or field value errors
(e.g. unreachable URL or syntactically incorrect email addresses).
    Figures 1 and 2 show the percentage of errors found in metadata fields by sec-
tion and by information type respectively. We observe that the most erroneous
information for the dataset core information is related to ownership since this
information is missing or undefined for 41% of the datasets. Datasets resources
have the poorest metadata. 64% of the general metadata, all the access informa-
tion and 80% of the provenance information contain missing or undefined values.
Table 1 shows the top metadata fields errors for each metadata information type.



             Metadata Field             Error % Section Error Type Auto Fix
                          group        100% Dataset     Missing           -
                    vocabulary_id      100%    Tag     Undefined          -
                        url-type      96.82% Resource Missing             -
     General
                   mimetype_inner     95.88% Resource Undefined          Yes
                          hash        95.51% Resource Undefined          Yes
                           size       81.55% Resource Undefined          Yes
                       cache_url       96.9% Resource Undefined           -
                     webstore_url     91.29% Resource Undefined           -
      Access          license_url     54.44% Dataset    Missing          Yes
                           url        30.89% Resource Unreachable         -
                     license_title     16.6% Dataset Undefined           Yes
                 cache_last_updated 96.91% Resource Undefined            Yes
               webstore_last_updated 95.88% Resource Undefined           Yes
    Provenance           created       86.8% Resource Missing            Yes
                    last_modified     79.87% Resource Undefined          Yes
                         version      60.23% Dataset Undefined            -
                  maintainer_email    55.21% Dataset Undefined            -
                      maintainer      51.35% Dataset Undefined            -
    Ownership       author_email      15.06% Dataset Undefined            -
               organization_image_url 10.81% Dataset Undefined            -
                         author        2.32% Dataset Undefined            -
           Table 1: Top metadata fields error % by information type
                 Observing The State of Linked Open Data Cloud Metadata          5

    We notice that 42.85% of the top metadata problems shown in table 1 can
be fixed automatically. Among them, 44.44% of these problems can be fixed by
our tool while the others can be fixed by tools that should be plugged into the
data portal. We further present and discuss the results grouped by metadata
information type in the following sub-sections.

3.3   General information
34 datasets (13.13%) do not have valid notes values. tags information for the
datasets are complete except for the vocabulary_id as this is missing from
all the datasets’ metadata. All the datasets groups information are missing
display_name, description, title, image_display_url, id, name. After
manual examination, we observe a clear overlap between group and organization
information. Many datasets like event-media use the organization field to show
group related information (being in the LOD Cloud) instead of the publishers
details.

3.4   Access information
25% of the datasets access information (being the dataset URL and any URL
defined in its groups) have issues: generally missing or unreachable URLs. 3
datasets (1.15%) do not have a URL defined (tip, uniprotdatabases, uniprot-
citations) while 45 datasets (17.3%) defined URLs are not accessible at the time
of writing this paper. One dataset does not have resources information (bio2rdf-
chebi) while the other datasets have a total of 1068 defined resources.
    On the datasets resources level, we notice wrong or inconsistent values in the
size and mimetype fields. However, 44 datasets have valid size field values and
54 have valid mimetype field values but they were not reachable, thus providing
incorrect information. 15 fields (68%) of all the other access metadata are missing
or have undefined values. Looking closely, we notice that most of these problems
can be easily fixed automatically by tools that can be plugged to the data portal.
For example, the top six missing fields are the cache_last_updated, cache_url,
urltype, webstore_last_updated, mimetype_inner and hash which can be
computed and filled automatically. However, the most important missing infor-
mation which require manual entry are the dataset’s name and description
which are missing from 817 (76.49%) and 98 (9.17%) resources respectively.
A total of 334 resources (31.27%) URLs were not reachable, thus affecting
highly the availability of these datasets. CKAN resources can be of various
predefined types (f ile, f ile.upload, api, visualization, codeanddocumentation).
Roomba also breaks down these unreachable resources according to their types:
211 (63.17%) resources do not have valid resource_type, 112 (33.53%) are files,
8 (2.39%) a re metadata and one (0.029%) are example and documentation types.
    To have more details about the resources URL types, we created a key :
objectmeta − f ieldvalues group level report on the LOD cloud with resources>
format:title. This will aggregate the resources format information for each
dataset. We observe that only 161 (62.16%) of the datasets valid URLs have
6      Ahmad Assaf, Raphaël Troncy and Aline Senart

SPARQL endpoints defined using the api/sparql resource format. 92.27% pro-
vided RDF example links and 56.3% provided direct links to RDF down-loadable
dumps.
    The noisiest part of the access metadata is about license information. A total
of 43 datasets (16.6%) does not have a defined license_title and license_id
fields, where 141 (54.44%) have missing license_url field.




          Fig. 1: Error % by section               Fig. 2: Error % by information type




3.5   Ownership information
Ownership information is divided into direct ownership (author and maintainer)
and organization information. Four fields (66.66%) of the direct ownership infor-
mation are missing or undefined. The breakdown for the missing information is:
55.21% maintainer_email, 51.35% maintainer, 15.06% author_email, 2.32%
author. Moreover, our framework performs checks to validate existing email val-
ues. 11 (0.05%) and 6 (0.05%) of the defined author_email and maintainer_email
fields are not valid email addresses respectively. For the organization informa-
tion, two field values (16.6%) were missing or undefined. 1.16% of the
organization_description and 10.81% of the organization_image_url in-
formation with two out of these URLs are unreachable.

3.6   Provenance information
80% of the resources provenance information are missing or undefined. However,
most of the provenance information (e.g. metadata_created, metadata_modified)
can be computed automatically by tools plugged into the data portal. The only
field requiring manual entry is the version field which was found to be missing
in 60.23% of the datasets.

3.7   Enriched Profiles
Roomba can automatically fix, when possible, the license information (title, url
and id) as well as the resources mimetype and size.
                 Observing The State of Linked Open Data Cloud Metadata           7

     20 resources (1.87%) have incorrect mimetype defined, while 52 resources
(4.82%) have incorrect size values. These values have been automatically fixed
based on the values defined in the HTTP response header.
     We have noticed that most of the issues surrounding license information are
related to ambiguous entries. To resolve that, we manually created a mapping
file8 standardizing the set of possible license names and urls using the open source
and knowledge license information9 . As a result, we managed to normalize 123
(47.49%) of the datasets’ license information.
     To check the impact of the corrected fields, we seeded Roomba with the
enriched profiles. Since Roomba uses file based cache system, we simply replaced
all the datasets json files in the \cache\datahub.io\datasets folder with those
generated in \cache\datahub.io\enriched. After running Roomba again on
the enriched profiles, we observe that the errors percentage for missing size
fields decreased by 32.02% and for mimetype fields by 50.93%. We also notice
that the error percentage for missing license_urls decreased by 2.32%.


4   Conclusion and Future Work

In this paper, we presented the results of running Roomba over the LOD cloud
group hosted in the Datahub. We discovered that the general state of the ex-
amined datasets needs attention as most of them lack informative access infor-
mation and their resources suffer low availability. These two metrics are of high
importance for enterprises looking to integrate and use external linked data. We
found out that the most erroneous information for the dataset core information
are ownership related since this information is missing or undefined for 41% of
the datasets. Datasets resources have the poorest metadata: 64% of the general
metadata, all the access information and 80% of the provenance information
contained missing or undefined values.
    We also show that the automatic correction process can effectively enhance
the quality of some information. We believe there is a need to have a community
effort to manually correct missing important information like ownership infor-
mation (maintainer, author, and maintainer and author emails). As part of our
future work, we plan to run Roomba on various data portals and perform a de-
tailed comparison to check the metadata health of LOD datasets against those
in other prominent data portals.


Acknowledgments

This research has been partially funded by the European Union’s 7th Framework
Programme via the project Apps4EU (GA No. 325090).
8
  https://github.com/ahmadassaf/opendata-checker/blob/master/util/
  licenseMappings.json
9
  https://github.com/okfn/licenses
8       Ahmad Assaf, Raphaël Troncy and Aline Senart

References
 1. A. Ahmad, S. Aline, and T. Raphaël. Roomba: Automatic Validation, Correc-
    tion and Generation of Dataset Metadata. In 24th World Wide Web Conference
    (WWW), Demos Track, Florence, Italy, 2015.
 2. J. Anja, C. Richard, and B. Chrstian. State of the lod cloud. http://lod-cloud.
    net/state/.
 3. B. Christian. Evolving the Web into a Global Data Space. In 28th British National
    Conference on Advances in Databases, 2011.
 4. B. Christian, H. T, and B.-L. T. Linked Data - The Story So Far. International
    Journal on Semantic Web and Information Systems (IJSWIS), 2009.
 5. B. Christoph, L. Johannes, and N. Felix. Creating voiD Descriptions for Web-scale
    Data. Journal of Web Semantics, 9(3):339–345, 2011.
 6. R. Cyganiak, J. Zhao, M. Hausenblas, and K. Alexander. Describing Linked
    Datasets with the VoID Vocabulary. W3C Note, 2011. http://www.w3.org/TR/
    void/.
 7. N. Douglas. Developing Spatial Data Infrastructures: The SDI Cookbook, 2004.
    http://www.gsdi.org/docs2004/Cookbook/cookbookV2.0.pdf.
 8. M. Fadi and E. John. Data Catalog Vocabulary (DCAT). W3C Recommendation,
    2014. http://www.w3.org/TR/vocab-dcat/.
 9. R. Isele, J. Umbrich, C. Bizer, and A. Harth. LDspider: An Open-source Crawl-
    ing Framework for the Web of Linked Data. In 9th International Semantic Web
    Conference (ISWC), Posters & Demos Track, 2010.
10. S. Max, B. Christian, and P. Heiko. Adoption of the Linked Data Best Practices
    in Different Topical Domains. In 13th International Semantic Web Conference
    (ISWC), 2014.
11. A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality
    Assessment Methodologies for Linked Open Data. Semantic Web Journal, 2012.