<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eficient Use of DALICC in Data Processing Pipelines with Fuzzy License Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kurt Junghanns</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norman Radtke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabine Gründer-Fahrer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Applied Informatics (InfAI)</institution>
          ,
          <addr-line>Leipzig</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Integration of huge amounts of data from various sources forms the basis for many of today's (web) applications and use cases. In order to be able to reuse, transform, process, analyze, aggregate and republish such data, terms of use and license information are particularly important. DALICC provides a very comprehensive database for licenses and ofers viable services for license handling that have been made available as open source. In this paper, we outline how DALICC can be used in (web) applications (represented by the project COYPU) in which heterogeneous data from various data sources with diferent usage conditions are processed. The paper aims to provide feedback on DALICC, to discuss necessary adjustments and present extensions that have been made by us. The overarching objective is to further activate and cooperatively develop the DALICC ecosystem.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;License clearance</kwd>
        <kwd>DALICC</kwd>
        <kwd>Dataset license processing</kwd>
        <kwd>RDF Knowledge Graphs</kwd>
        <kwd>Constraints</kwd>
        <kwd>Feedback</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(commercial) usage scenarios will be supported which aim at the evaluation and improvement
of the resilience of value creation networks (companies, markets and regions) and decisions
for action. Information about global crises, sanctions, environmental disasters, scarcity of
resources, geopolitical conditions and information about markets, companies, products, logistics
is included. It comes from various sources (i.e. ACLED2 and Weltrisikoindex 3), in diferent
formats and is very heterogeneously linked to metadata and license information4. The following
types of data can be distinguished on basis of their of the terms of use and the processing
possible.</p>
      <p>• Data processable - License is interlinked; terms of use can be dereferenced (results are
machine-readable).
• Data processable with additional efort - terms of use are given, but have to be
manually converted in a machine-readable form (license not given or not dereferenceable)
• Data not processable - Neither license nor terms of use are given.</p>
      <p>
        Due to the massive amount of data, it is particularly important to acquire the terms of use
(permissions, prohibitions, obligations) as eficiently as possible. A further challenge is the
aggregation of datasets for which the terms of use are not subsumed by license URIs. The Data
Licenses Clearance Center (DALICC) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has a very comprehensive database for licenses and
ofers viable services for license handling that have been made available as open source. In this
paper, we outline how DALICC can be used in large-scale (web) applications (represented by
COYPU) in which heterogeneous data from various data sources with diferent usage conditions
are processed. The aim of the paper is to provide feedback on DALICC, to discuss necessary
adjustments and announce extensions that have been made by us.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>DALICC5 ofers, among other things, services for eficient handling of licenses to providers
of digital assets (commercial and open source) and interested parties. The publicly available
software framework consists of three main components: license library, license search and
license composer. DALICC is using RDF and SPARQL to provide processable license details,
facet based search, conflict detection and license resolving published via Github 6. The DALICC
services are highly valuable and thus used in COYPU for (a) dereferencing the terms of use based
on existing license URIs (forward search), (b) determining license compatibility (c) determining
possible licenses based on given permissions, prohibitions and obligations (reverse search).</p>
      <p>The Data Catalog Vocabulary (DCAT)7 is the standard RDF vocabulary to describe data
catalogs, data services and datasets with their metadata, esp. its sources. In this paper, DCAT
is used to describe data sources and link to its licenses. The Open Digital Rights Language8
2https://acleddata.com/
3https://weltrisikobericht.de
4A full list of public data sources can be found at https://datasets.coypu.org/
5https://www.dalicc.net/
6https://github.com/dalicc/dalicc
7https://www.w3.org/TR/vocab-dcat-2/
8https://www.w3.org/TR/odrl-vocab/
(ODRL) is a policy expression language including a flexible and interoperable information model,
vocabulary, and encoding mechanisms and providing a basis for representing statements about
usage of content and services. Its concepts are used in DALICC and in our use case alike.</p>
      <p>The COYPU approach has been created as an extension of the common DALICC use case.
Figure 1 drafts the two use cases. In case 1, there is a license graph with a set number of licenses,
available as a SPARQL endpoint and used by diferent API routes to support applications using
processable datasets which have their licenses linked. This already allows for retrieving details
of licenses, facet based search and compatibility checking.</p>
      <p>In case 2, the terms of use of various data sources do not match any of the licenses currently
available in DALICC, or even not fit with any existing license at all. For instance, in COYPU,
this afects 13 out of of 40 data sources of our test sample. Hence, users need more support with
license information and - if necessary - to add it. Moreover, as the use case includes commercial
as well as scientific application scenarios, an applicable solution should represent relevant
distinctions very prominently and accessible in their respective license information and the
DALICC API routes. Last but not least, whenever datasets from diferent sources are to be
aggregated, combined, published and used together, users are dependent on the tools used and
their terms of use. In a large-scale application scenario, such as COYPU, this functionality should
be considered the most wanted as well as most complex requirement, as it builds and depends on
the availability and quality of processes for the simple cases and a non-trivial combinatorial logic.
Using DALICC as our basis, we have so far implemented9 the following extensions to tackle the
issues just mentioned. To the list of licenses available in DALICC we added RDF-representations
of Datenlizenz Deutschland – Namensnennung – Version 2.0 (DLDEBY20) as well as a license
placeholder including prohibitions, duties and permissions for the Weltrisikoindex. At the same
time, we modified the existing API as to deliver turtle instead of JSON as its output format and
supplemented it by a flag reflecting whether commercial use of a dataset is permitted or not.
Taking into account the sovereignty of DALICC with respect to their license resources and
9Gitlab repository: https://gitlab.com/coypu-project/dalicc
the COYPU tooling, which is performing IRI resolving only with respect to internal graphs,
internal IRIs for all licenses are created, enriched by further internal information according to
our project needs and represented in a Coypu license KG.</p>
      <p>Future work will focus on adding more licenses and license details (e.g., regarding permission
for data analysis and entanglement). To tackle the problem of derived terms of use for aggregated
dataset, we are currently cooperating with legal experts to work out a basic combinatorial logic.
Furthermore, we plan to implement new API routes for the validation and upload of new licenses
as well as for tracking changes in terms of use or data licenses via semantic versioning. Most of
our contributions will be made available via pull requests on Github to give back contributions
to DALICC.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion and Open Questions</title>
      <p>In application scenarios with well-defined and well-known data ecosystems, DALICC can be
used for data license processing directly. Thereby it closes a large gap in getting data eficiently
into applications. However, if license information and terms of use are not available or provided
only rudimentarily, the eforts are significantly increasing and direct application of the approach
may fail. At this point, DALICC still provides valuable support by ofering its resources open
source, thereby enabling further development, as has been outlined in this paper.</p>
      <p>As alternative applications presumably have similar requirements for the eficient handling
of license data and, in particular, for the addition of further license resources, the use of
crossproject synergies would be a worthwhile goal. At this point, urgent questions appear and need
to be discussed within the research community. For instance, are project-specific terms of use
and licenses to be published in open repositories or are they to be committed via a central
web service with a subsequent review? How could consistency of diferent additions to an
assumed common resource be ensured? As a step into this direction, we envisage to evaluate to
what extent additional terms of use can be reused by applying similarity measures. It would
be interesting and helpful for us to have available other experience reports, in order to learn
which extensions got implemented for application of DALICC in alternative project contexts.</p>
      <p>With this feedback and our outlined approach on extending the currently available
functionalities, we hope to contribute to the activation and further cooperative development of the
DALICC ecosystem.</p>
      <p>Acknowledgements The authors acknowledge the financial support by the Federal
Ministry for Economic Afairs and Energy of Germany in the project COYPU (project number
01MK21007[A]).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pellegrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mireles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steyskal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Panasiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fensel</surname>
          </string-name>
          , S. Kirrane,
          <article-title>Automated rights clearance using semantic web technologies: The DALICC framework</article-title>
          , in: T.
          <string-name>
            <surname>Hoppe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Humm</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Reibold (Eds.),
          <string-name>
            <surname>Semantic</surname>
            <given-names>Applications</given-names>
          </string-name>
          , Methodology, Technology, Corporate Use, Springer,
          <year>2018</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>218</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>662</fpage>
          -55433-3\_
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>