Towards automatic deployment of Linked Data
                     Platforms

                                Noorani Bakerally

Univ Lyon, IMT Mines Saint-Étienne, CNRS, Laboratoire Hubert Curien UMR 5516,
                         F-42023 Saint-Étienne, France
                       noorani.bakerally@emse.fr

1    Problem Statement
    In the open data context, multiple data sources found at diﬀerent locations
on the Web generate massive heterogeneous sets of data. Using data from these
data sources can provide useful information for decision making purposes in
diﬀerent domains. However, the levels of heterogeneity (e.g. syntactic, semantic,
access etc.) render the exploitation of these data sources complex. Linked data
platforms complying with the Linked Data Platform (LDP) standard [13], which
we refer to as LDPs, can be used to facilitate data exploitation from these sources
by providing a homogeneous view and access to one or more of them. However,
the main problem is that putting in place an LDP in itself is complex. It can be
decomposed into sub-problems categorized in the three main phases of an LDP
development life cycle: design, implementation and deployment.
    The design phase involves characterizing design aspects (e.g. characterizing
content of Linked Data resources) and taking design decisions based on de-
sign choices (e.g. providing only triples where the Linked Data resource is the
subject). Design decisions can be based on intuition and made without a proper
justification or can also be supported by design guides. We refer to design guides
as the set of standards, best practices, principles or design patterns which can
be used to guide a design decision. The problem P1 is that this phase is com-
plex as design decision making process is manual and thus time consuming. Also
knowledge of design guides is required to make proper design decisions.
    After the design phase, the implementation phase starts the design decisions
are encoded in the final product. The problem P2 in this phase is that existing
LDP solutions do not provide native support for decoupling the design and the
implementation. As a result, using the solutions as they are can lead to tight
coupling between the design and the implementation, hence, making it complex
to maintain and reuse the design.
    Finally, the implementation is deployed in a software environment. During
the deployment phase of data-driven systems like LDPs, both system and data
must be deployed. Considering our context of open data, there are data having
hosting constraints, such as storage limitations or license restrictions, which arise
when exploiting external constrained data sources. The problem P3 in this phase
is that the constraints prevent from deploying a copy of the data in the software
environment thus making it complex to directly exploit these constrained data
sources.
2       Towards automatic deployment of Linked Data Platforms

2     Relevancy

   Having stated the problems (P1, P2, P3) that we consider, lets turn now to
motivate and stress their relevance using a concrete case of smart cities within the
open data context which we are dealing with in the OpenSensingCity1 project.
    The governmental body responsible for a city may want to facilitate access
to city data so that it may be utilized for useful ends such as developing smart
city applications which can aid citizens in their daily activities. In this case, an
LDP may be used as it can uniformize access to data thus enabling much in-
teroperability. A city is a decentralized ecosystem having diﬀerent organizations
which can provide data. As such, the city LDP may itself have to access and
exploit other data sources which may be owned and controlled by other organi-
zations operating in diﬀerent domains (e.g transportation, parking). There may
be organizations wishing to participate in this eﬀort by opening their data via
LDPs. However, they may be reluctant as they may not want to invest much time
in developing LDPs (cf. P1). Therefore support, such as automated solutions,
which conforms to standards and best practices, may help to speed the design.
Besides this gain of time, it may help engineers who may not master current
standards and best practices, to guarantee the respect of these standards and
best practices in the publication of their data.
    Considering the above case, there may be new organizations opening their
data and thus require maintaining the design of the city LDP to consider the
new data sources. A tight coupling between the design and implementation (cf.
P2) may make the maintenance complex and lengthen this process. Moreover,
there may be another city which may also want to expose data using an LDP
having a similar design. Again, due to the tight coupling issue (cf. P2), it may
not be possible to directly reuse the design and thus requiring the other city to
re-encode the design in the implementation. If the design is completely decou-
pled from the implementation, it becomes highly reusable and can be directly
applied to another LDP. Applying this in the city scenario, diﬀerent cities may
reuse a design, exposing data in the same way. As a result, generic smart city
applications may be developed for diﬀerent purposes, such as finding parkings
or transportation modalities. These applications may exploit any city LDP as
long as the LDPs use a design and vocabulary known by the application.
    If hosting constraints (cf. P3) are not taken into consideration, manual de-
velopment of data adapters may be required to exploit constrained data sources.
Considering the above case, city LDPs may have to exploit data sources from or-
ganization which are bounded by licenses or data from sources like LinkGeoData2
which may be diﬃcult to host due to storage limitations. Providing support for
hosting constraints may facilitate exploitation of these constrained data sources
without having to perform further manual development.

1
    http://opensensingcity.emse.fr
2
    http://linkedgeodata.org on 1 May 2017
                  Towards automatic deployment of Linked Data Platforms         3

3   Related Work
    There are solutions for publishing data via a linked data platform. Among
these solutions, some conform to the LDP standard and others do not. Non-
LDP conformant solutions are important to consider because even if they do
not comply with the LDP standard, they minimally address some problems that
we consider. Pubby3 , D2R Server [3], Virtuoso 4 , Triplify [1]) can automatically
deploy linked data from existing RDF data sources or data sources having a
virtual RDF representation. Triplify and D2R, have been designed to expose
relational data as linked data. As such, they are more focused on defining map-
pings between relational database schemas and RDF vocabularies. The final step
of providing the RDF as linked data mostly only involves ensuring that RDF
resources can be dereferenced with RDF data. In addition to providing relational
data as linked data, Virtuoso also provides a linked data interface to its triple
store. Pubby can provide a linked data interface both to SPARQL endpoints and
static RDF documents. Pubby and Virtuoso are the only tools for directly pub-
lishing RDF data as linked data. However, most design decisions are fixed and
cannot be parameterized. For example, it is not possible to specify the content
to provide for RDF resources. In addition, they rely on implementation-specific
details. For example, to obtain content for RDF resources, Pubby uses SPARQL
DESCRIBE queries which are implementation specific and determined by the
SPARQL query processor. In summary, related to our problems described in
Section 1, non-LDP conformant solutions address P1 by providing some level of
automization but the automatized design decisions seem to be based on intuition.
P2 is considered by using a vocabulary to declaratively describe the design sepa-
rately from the implementation. However, the vocabulary has a low expressivity
resulting in design aspects which cannot be characterized such as specifying the
content definition of an linked data resources. P3 is partly addressed as the
solutions can be used to deploy linked data platform on some constrained data
sources. However, there are some issues which are left aside such as the treatment
of blank nodes, external resources and entailment regimes.
    Linked Data Platform 1.0 [13] is a W3C recommendation that provides a set
of rules for read-write linked data via HTTP. As of now, we refer to the LDP
standard as the W3C recommendation and LDP as a data platform conforming
to this standard. Data resources exposed via LDPs are referred as LDP Resources
(LDPR). LDPRs can be organized in collections, referred to as LDP containers.
LDP containers are specialization of LDPR which helps organizing other re-
sources (e.g LDP container members). There are LDP solutions. We restrict our
analysis of them to those mentioned in LDP implementation conformance report5
which show their degree of conformance to the LDP standard. We categorize so-
lutions implementing the LDP standard into LDPR management system (Cal-


3
  http://wifo5-03.informatik.uni-mannheim.de/pubby/ on 18 May 2017
4
  https://virtuoso.openlinksw.com on 19 July 2017
5
  https://www.w3.org/2012/ldp/hg/tests/reports/ldp.html on 19 July
  2017
4      Towards automatic deployment of Linked Data Platforms

limachus6 , Carbon LDP7 , Fedora Commons8 , Apache Marmotta9 , Virtuoso10 ,
gold11 , rww-play12 ) and LDP framework (Eclipse Lyo13 , LDP4j [7, 8]). LDPR
management system can be seen as a repository for LDPRs on top of which
CRUD operations, adhering to the LDP standard, are allowed through HTTP
methods. LDP frameworks are solutions which can be used to build custom ap-
plications which implement LDP interactions. LDP-conformant solutions are in
their early stages as there is no support for deploying even existing RDF data
via an LDP. To do so, user has to take all the design decisions related to linked
data publication such as IRI definitions for RDF Web documents, structure of
these document and their content definition and write custom data adapters for
generating LDPRs to materialize them in an LDP. Based on our analysis, we
believe these solutions solutions suﬀer from all the problems mentioned in Sec. 1.
4   Research Questions and Hypotheses
    From the problems and analysis of the state of the art, the main research
question of our work is whether it is possible to automatically deploy an LDP.
This question in turn generates many other questions which are centered around
two main aspects, design (RQ1, RQ2) and deployment (RQ3, RQ4) of LDPs.
The research questions and their hypotheses are:

RQ1 How to decouple the design from the implementation of LDPs?
    – We hypothesize that the design of LDPs can be described declaratively
      in a separate document using a dedicated vocabulary.
RQ2 How to generate the design data sources?
    – We hypothesize that a default design document for an LDP can be
      generated using the ontologies associated with the data sources.
RQ3 How to automatize deployment of an LDP?
    – We hypothesize that the deployment of an LDP can be automatized
      using its design document.
RQ4 How to consider hosting constraints when deploying LDP?
    – We hypothesize that hosting constraints can be considered by generat-
      ing LDP resources on the fly using the design document on the LDP.

5   Proposed Approach

    To answer our research questions, we provide an approach which uses prin-
ciples from model-driven engineering. MDE provides general principles for ad-
dressing the problems mentioned in Sec. 1. In summary, it involves using models
6
   http://callimachusproject.org on 15 July 2017
7
   http://carbonldp.com on 15 July 2017
 8
   http://fedora-commons.org on 15 July 2017
 9
   http://marmotta.apache.org on 15 July 2017
10
   http://www.openlinksw.com/ on 15 July 2017
11
   https://github.com/linkeddata/gold on 15 July 2017
12
   https://github.com/read-write-web/rww-play on 15 July 2017
13
   http://wiki.eclipse.org/Lyo/LDPImpl on 15 July 2017
                              Towards automatic deployment of Linked Data Platforms                                        5

as first-class entities and transforming them into running systems by using gen-
erators or by dynamically executing the models at run-time [9,10]. By separating
models from the running systems, MDE enables separation of concerns thus guar-
anteeing higher maintainability of software systems and reusability of systems’
models [14]. Therefore, in our approach, we intend to apply MDE principles in
the context of LDPs to automatize ther design and deployment.
    Figure 1 shows a general overview of our approach. The circled numbers
in the text below refer to the corresponding circles from the figure. Each of the
circled numbers are components. They are described below and relate to research
questions in Sec. 4. 1. relates to RQ1, 2. relates to RQ2, 3. and 4. relates
to RQ3 and finally 3. and 5. relates to RQ4.


                                                      Deployment Phase
                                                            {json} <xml/>
                      Design Phase                                                4 Physical Deployment
                                                         Generate                        Deploy
                                                          LDPRs                          LDPRs
                                 Generate         1
                               2 Design
                                Document
                                                            Generate                     Validate
       Platform                                         3    LDPR                        Request
                                                                                                                Platform
       Creator                                                                                                    User
                                                                  {json} <xml/>
              RDF Graph           Ontology
                                                                                        Generate
                  Parameters       Data flow                                                              LDP
                                                                                         Reply
                                   Control flow
                   SPARQL                                                            5 Virtual Deployment
                   Endpoint       Design Document


                                Fig. 1. General Overview of our Approach


5.1   Design Phase Support
    The aim of the design phase is to produce the design document 1. . To decou-
ple the design from the implementation of an LDP, we will provide a vocabulary
to declaratively describe the design in the design document separately from the
implementation. To further facilitate the creation of this document, we will pro-
vide a method 2. to automatically produce it from existing data sources. In the
next two paragraphs, we describe the vocabulary and 2. .
    The vocabulary will be formalized using ontology language, such as RDFS or
OWL, based on the required level of expressivity. The terms from the vocabu-
lary will correspond to design aspects and their respective design choices. To set
up this vocabulary, first, we will identify the design aspects of an LDP. Design
aspects can be identified from existing linked data platforms and also from stan-
dards and best practices. For instance, the following elements issued from the
LDP standard are good starting points: organization of LDP resources in terms
of LDP containers, LDP container types, content definition of LDP containers
and their members and IRI structure of LDP container and their members. Af-
ter design aspects have been identified, a set of design choices for each of design
6         Towards automatic deployment of Linked Data Platforms

aspect has to be established. For some, established design choices already exist.
For others, no established design patterns exist(e.g. to characterize the content
of an LDP container or their members). To this aim, we intend to perform ex-
periments on current deployed Linked Data to analyze current trends, derive
design patterns and set up design choices based on them.
    The core aspect of 2. is to generate the design document by automating
generic design decisions. An example of a generic decision can be to use the
class hierarchy of the ontology to generate LDP containers. All generic design
decisions made will be based on design patterns which will be derived from
our experiments or analysis. We intend to make 2. parameterizable to provide
some level of customizations and allow the user to have some control over the
generated design document. 2. will generate the design document using the
metamodel of the data and possibly some parameters provided by the platform
creator. Related to the metamodel, we consider two types of data, RDF data
and Non-RDF data (CSV, JSON, XML, etc.). For RDF data, the metamodel
can either be an explicit ontology or the ontology implicit in the data. For Non-
RDF data (CSV, JSON, XML, etc.), the metamodel is the ontology used for
RDFizing the data. Although we do not explicitly address data heterogeneity,
we will use existing works in this area such as GRDDL [4], R2RML [5], RML [6]
and SPARQL Generate [11].
5.2 Deployment Phase Support
    We consider two strategies for supporting deployment of LDPs, Physical
Deployment 4. and Virtual Deployment 5. . The process Generate LDPR 3.
is based on a method, which we will introduce, and is common in both strategies.
 3. will use the design document, extract the appropriate design decisions made
for LDPRs and generate the content for them. In 4. , 3. is used to generate
all the LDPRs and materialize them in one go in an LDP. In 5. , on obtaining
a request for an LDPR, there is a validation process after which the LDPR is
generated using 3. . In contrast to 4. , in 5. , the LDPRs cannot be material-
ized. Instead, data resources remain in their external data sources in their native
state. Serving them as LDPRs requires querying these data sources, applying the
required transformation to generate them.
    As such, one potential problem with 5. is the extra time taken to generate
the reply compared to 4. . One reason why this may occur is due to having a
dynamic controller which exploits the design document every time a request is
obtained to validate and process it. A possible solution may be to have diﬀerent
types of controllers namely:
    – Semi-dynamic controllers: Dedicated controllers are generated for diﬀerent
      categories of LDPRs. The required segments from the design document is en-
      coded in the generated dedicated controllers to avoid considering the design
      document for processing every request. The controllers may be re-generated
      if the design document changes.
    – Dynamic controllers: Only one generic controller is used and the design doc-
      ument is considered when processing every request.
                  Towards automatic deployment of Linked Data Platforms       7

  Components 3. and 4. are our two main strategies for supporting deploy-
ment. We also intend to consider hybrid deployment strategies using both 4.
and 5. such as where some LDPRs are materialized in data stores while some
others need to be generated on the fly at query time.
6   Evaluation Plan
    We intend to make the evaluation more robust as the research matures and
also using feedback from this doctoral consortium. As of now, our evaluation is
structured as shown below:
1. Flexibility: Show the flexibility of our approach by performing experiments
   in diﬀerent concrete scenarios using real datasets. These experiments will
   also answer our research quetions. We intend to:
   (a) use similar design for LDPs having diﬀerent data sources and use diﬀer-
       ent designs for LDPs with similar data sources (cf. RQ1);
   (b) use linked data to build a design document where modular parts of other
       design document are combined to obtain the final design document (cf.
       RQ1);
   (c) directly use generated design documents from RDF datasets for auto-
       matically deploying them as LDPs (cf. RQ2, RQ3);
   (d) deploy LDP on top of dynamic smart city data sources having hosting
       constraints (cf. RQ4).
2. Quality: Evaluate the automatized design with respect to the quality of
   Linked Data it generates using an evaluation framework which we will in-
   troduce based on Data on the Web Best Practices [12]. For best practices
   from [12], we will setup metrics wherever possible as some best practices are
   purely qualitative;
3. Performance: Evaluate performance of Physical Deployment against Vir-
   tual Deployment. This will also involve evaluating semi-dynamic controllers
   against dynamic controllers in Virtual Deployment to provide an insights of
   the optimization which is possible using semi-dynamic controllers;
    Finally, since we are using an MDE approach, we intend to investigate the
state of the art for the evaluation of such approaches and apply them, wherever
possible, in our approach.
7   Reflections
    Some of the works performed in our PhD have been good starting points
for understanding the context of work discussed in this paper. We developed
from scratch SCANS [2] which is a linked data platform for smart city artifacts.
Doing so has broaden our knowledge of design and deployment of linked data
platforms and enabled us to see the diﬀerent problems which can arise when
developing them. We experimented with data platform solutions and found that
there are tasks which are repetitive and can be automated. We also found that
vocabularies are already in use (cf. Sec.3) to describe some minimal design as-
pects which shows that it is possible to generate a data platform using a design
8       Towards automatic deployment of Linked Data Platforms

description. While writing custom data adapters to generate data resources, we
found that they can be generalized and changing parameters be externalized,
hence making room to consider hosting constraints. We have also seen patterns
both in the IRIs and content of linked data resources. For example, we have
noticed that the content of linked data resources includes those triples from the
original RDF graph where the resource is either the subject or the object. This
pattern can be formalized as a design choice. More such patterns exists in cur-
rent deployed Linked Data and we will perform experiment to find relevant ones
for our works. Moreover, after going through the diﬀerent best practices pro-
vided by [12], we believe that our approach will be able to conform to many of
them thus guaranteeing publishing data conforming to best practices. Based on
concrete observations mentioned, we believe that our approach can be successful.
Acknowledgments This work is supported by ANR grant 14-CE24-0029 for
project OpenSensingCity. The author would also like to acknowledge the thesis’s
supervisors, Antoine Zimmermann and Olivier Boissier.
References
 1. S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and D. Aumueller. Triplify: light-
    weight linked data publication from relational databases. In Proceedings of the
    18th international conference WWW. ACM, 2009.
 2. Noorani Bakerally, Olivier Boissier, and Antoine Zimmermann. Smart city artifacts
    web portal. In ESWC. Springer, 2016.
 3. Christian Bizer and Richard Cyganiak. D2R server-publishing relational databases
    on the semantic web. In Poster at the 5th ISWC, volume 175, 2006.
 4. Dan Connolly et al. Gleaning resource descriptions from dialects of languages
    (GRDDL). W3C, Recommendation REC-grddl-20070911, 2007.
 5. Souripriya Das, Seema Sundara, and Richard Cyganiak. R2RML: RDB to RDF
    Mapping Language, W3C Recommendation 27 September 2012. Technical report.
 6. A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and Rik
    Van de Walle. RML: A generic language for integrated RDF mappings of hetero-
    geneous data. 2014.
 7. M. Esteban-Gutiérrez, N. Mihindukulasooriya, and R. Garcı́a-Castro. Interopera-
    ble read-write linked data application development with the LDP4j framework.
 8. M. Esteban-Gutiérrez, N. Mihindukulasooriya, and R. Garcı́a-Castro. LDP4j: A
    framework for the development of interoperable read-write linked data applications.
    2014.
 9. Robert France and Bernhard Rumpe. Model-driven development of complex soft-
    ware: A research roadmap. In 2007 FSE. IEEE Computer Society, 2007.
10. Jochen Küster. Model-driven software engineering foundations of model-driven
    software engineering. IBM Research, 2011.
11. M. Lefrançois, A. Zimmermann, and N. Bakerally. A SPARQL extension for gen-
    erating rdf from heterogeneous formats. In 14th ESWC 2017, 2017.
12. Bernadette Farias Lóscio, Caroline Burle, and Newton Calegari. Data on the web
    best practices. Working draft, W3C, January 2016.
13. Steve Speicher, John Arwe, and Ashok Malhotra. Linked Data Platform 1.0. Tech-
    nical report, World Wide Web Consortium (W3C), February 26 2015.
14. Thomas Stahl and Markus Volter. Model-driven software development: technology,
    engineering, management. J. Wiley & Sons, 2006.