<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>STATisfy Me: What are my Stats?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gezim Sejdiu</string-name>
          <email>sejdiu@cs.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Ermilov</string-name>
          <email>iermilov@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Nadjib Mami</string-name>
          <email>mami@cs.uni-bonn.de</email>
          <email>mohamed.nadjib.mami@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmann@cs.uni-bonn.de</email>
          <email>jens.lehmann@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Leipzig</institution>
          ,
          <addr-line>04109 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Smart Data Analytics, University of Bonn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The increasing adoption of the Linked Data format, RDF, over the last two decades has brought new opportunities. It has also raised new challenges though, especially when it comes to managing and processing large amounts of RDF data. In particular, assessing the internal structure of a data set is important, since it enables users to understand the data better. One prominent way of assessment is computing statistics about the instances and schema of a data set. However, computing statistics of large RDF data is computationally expensive. To overcome this challenging situation, we previously built DistLODStats, a framework for parallel calculation of 32 statistical criteria over large RDF datasets, based on Apache Spark. Running DistLODStats is, thus, done via submitting jobs to a Spark cluster. Often times, this process is done manually, either by connecting to the cluster machine or via a dedicated resource manager. This approach is inconvenient as it requires acquiring new software skills as well as the direct interaction of users with the cluster. In order to make the use of DistLODStats easier, we propose in this paper an approach for triggering RDF statistics calculation remotely simply using HTTP requests. DistLODStats is built as a plugin into the larger SANSA Framework and makes use of Apache Livy, a novel lightweight solution for interacting with Spark cluster via a REST Interface.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        SANSA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an open source framework4 that allows RDF processing at scale. It
provides a set of libraries for executing SPARQL queries, performing inference as well
as analytics over knowledge graphs, all while supporting several RDF representations.
In addition, it provides support for RDF dataset statistics and quality assessment for
large-scale RDF datasets. The statistics are calculatated using the dedicated
component DistLODStats [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is a distributed and scalable software able to compute 32
statistical criteria (intially proposed at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>SANSA and DistLODStats use Apache Spark5 as an underlying engine, which is a
popular framework for processing large datasets in-memory. Spark provides two
possibilities of running and interacting with applications:
4 https://github.com/SANSA-Stack
5 http://spark.apache.org/
– Interactive - via a command line interface (CLI) called Spark Shell, or via Spark</p>
      <p>
        Notebooks (e.g. SANSA-Notebooks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]),
– Batch - which includes a bash script called spark-submit used to submit a Spark
application to the cluster without interaction during run time.
      </p>
      <p>Spark application is usually launched by logging first into a cluster, either in the
premises or remotely in the cloud. This process presents several di culties:
– It requires a sophisticated user access control management, which may become
hard to maintain with multiple users.
– It raises the chances of exhausting the cluster or even causing its failure.
– It exposes cluster and its configurations to all the users with access.</p>
      <p>In order to elevate those, we have investigated Apache Livy 6 – a novel open source
REST interface for interacting remotely with Apache Spark. It supports executing
snippets of code or programs in a Spark context that runs locally, in a Spark cluster or in
Apache Hadoop YARN.</p>
      <p>
        This is an accompanying poster paper for DistLODStats [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which was accepted
at the ISWC resource track. The addition made in this poster is an interactive REST
API for DistLODStats, which enables calculating RDF dataset statistics remotely i.e.,
without a direct contact with the hosting cluster.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>STATisfy: A REST Interface for DistLODStats</title>
      <p>Traditionally, when running a Spark job, submitting it to a Spark cluster is done via a
spark-shell or spark-submit. Usually, this process is done manually either entering the
cluster gateway machines or via a dedicated resource manager (e.g. SLURM,
OpenStack).
6 https://livy.incubator.apache.org/</p>
      <p>For users with little experience in cluster management and the Hadoop
infrastructure, it can be challenging to run Spark. As an alternative, we introduce STATisfy7:
REST Interface for DistLODStats. Instead of computing RDF statistics directly on the
cluster the interaction is done via REST APIs (as it is depicted in the Figure 1).</p>
      <p>The client side will create a remote Spark cluster for initialization, and submit jobs
through REST APIs. Livy REST Server will then discover this job and send through
remote procedure call (RPC) to SparkSession, where the code will be initialized and
executed. In the meantime, the client will be waiting for the result of this job coming
from the same direction.</p>
      <p>Running the STATisfy is similar to using DistLODStats via spark-submit. The
difference is that this shell is not running locally, instead, it runs in a cluster and transfers
the data back and forth through the network.</p>
      <p>For demonstrating the usage of the tool, we have deployed it on the comprehensive
statistics catalogue LODStats8 which crawls RDF data from metadata portals such as
CKAN dataset metadata registry. By doing this, it obtains a comprehensive picture of
the current state of the Web of Data. As we use DistLODStats as an underlying engine
for computing RDF statistics afterwards, the limitation was that the user has to interact
with the cluster manually and initiate the job for computing such statistics. By using
STATisfy REST interface, LODStats will interact with the cluster from anywhere which
provides the capabilities necessary to do this without compromising on ease of use or
security.</p>
      <p>As it is shown on the Figure 2, user starts a session via REST API using Livy for
submitting a job to the Spark cluster.
7 https://github.com/GezimSejdiu/STATisfy
8 http://lodstats.aksw.org/
The script (see Listing 1.1) contains a spark-submit configurations which is given in the
format of a JSON structure with the necessary information like spark-submit. With the
POST request POST /batches user could submit a request to DistLODStats using Livy
server. Using Livy, STATisfy will then help to launch this request in the cluster. As a
result, the output will be curled by their end in the format of VoID description.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>In order to deepen their understanding of the data, many users require gathering
statistical information about RDF datasets. This process becomes compute-intensive when the
datasets grow in size. DistLODStats is a prominent solution, however, it requires setup
and managing of the the cluster configuration and job submission. To make the
process easier, we have introduced STATisfy, a tool for interacting with DistLODStats via
a REST Interface. This way DistLODStats can be provided as-a-service, where users
only send (HTTP) requests to the remote cluster and obtain the wished results, without
having any knowledge about system access or cluster management. STATisfy is used
for the LODStats project and an inclusion in the new DBpedia9 community release
processes is ongoing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Demter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>Lodstats-an extensible framework for highperformance dataset analytics</article-title>
          .
          <source>In Proceedings of the EKAW 2012, Lecture Notes in Computer Science (LNCS) 7603</source>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>I.</given-names>
            <surname>Ermilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sejdiu</surname>
          </string-name>
          , L. Bu¨hmann, P. Westphal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Petzka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngonga</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          .
          <article-title>The Tale of Sansa Spark</article-title>
          .
          <source>In Proceedings of 16th International Semantic Web Conference, Poster &amp; Demos</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sejdiu</surname>
          </string-name>
          , L. Bu¨hmann, P. Westphal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          , I. Ermilov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngonga</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          .
          <article-title>Distributed Semantic Analytics using the SANSA Stack</article-title>
          .
          <source>In Proceedings of 16th International Semantic Web Conference</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>G.</given-names>
            <surname>Sejdiu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ermilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadjib-Mami</surname>
          </string-name>
          .
          <article-title>DistLODStats: Distributed Computation of RDF Dataset Statistics</article-title>
          .
          <source>In Proceedings of 17th International Semantic Web Conference</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>