<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Analytics in the Palm of your Browser</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carsten Felix Draschner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Farshad Bakhshandegan Moghaddam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hajira Jabeen</string-name>
          <email>hajira.jabeen@uni-koeln.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bonn</institution>
          ,
          <addr-line>Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Cologne</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked open data sources and the semantic web has become a precious data source for data analytics tasks and data integration. The growing data set sizes of RDF Knowledge Graph data need scalable processing and analytics techniques. The processing power of in-memory frameworks which can perform scalable distributed semantic analytics like SANSA, make use of Apache Spark and Apache Jena to provide start-to-end extensive scalable analytics on RDF knowledge graphs. The setup of a technical system with all dependencies and environments can be a tough challenge and might also require suficient available processing power. To reduce the entry barriers for getting started in evaluating and testing all opportunities of the SANSA framework and even bring this technology to production only from the browser. We introduce within this paper how to get the SANSA stack running within Databricks, with no need for special Apache Spark skills or any installations. This simplified usage ofers distributed large-scale processing of RDF data from mobile devices. In addition, the availability of Hands-On Sample Notebooks increases the reproducibility of complex framework evaluation experiments. This paper shows that the startup of a very complex scalable semantic data analytics stack framework does not need to be complicated.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Analytics</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Semantic Similarity</kwd>
        <kwd>Distributed Processing</kwd>
        <kwd>Apache Spark</kwd>
        <kwd>Resource Description Framework</kwd>
        <kwd>RDF</kwd>
        <kwd>SANSA</kwd>
        <kwd>Databricks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>An increasing number of data sets based on the linked open data principle have
appeared in recent years. These ofer tremendous potential for data integration
through the use of IRIs and URIs. Moreover, in the area of energy data,
semantic data is emerging in many projects and is being used for various data
analytics tasks. Due to the large amount of data, solutions in the area of big
data processing are necessary.</p>
      <p>Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        These technologies, such as Apache Spark3, enable fast memory and
parallel processing of data to scale arbitrarily across parallel cores by optimizing
them for distributed cluster computing. The Scalable Semantic Analytics Stack
SANSA[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] leverages Apache Spark and Apache Jena4 to provide an open-source
framework for start-to-end data analytic pipelines for large-scale RDF
Knowledge Graphs [
        <xref ref-type="bibr" rid="ref10 ref11 ref7">7,10,11</xref>
        ].
      </p>
      <p>Distributed frameworks like the SANSA stack have a vast potential in the
energy domain to process large-scale Knowledge Graphs. However, the technical
requirements in the initial setup and the first experimentation for less mature
Data Scientists and Machine learning engineers are challenging and
overwhelming entry hurdles for the development of first experiments, proof of concepts, or
Minimal Valuable Products (MVPs).</p>
      <p>To reduce this hurdle, we would like to show how users from the energy sector
and all other fields which want to build large-scale RDF Knowledge graph data
analytic pipelines can use SANSA with minimal technical requirements and entry
hurdle.</p>
      <p>The main contributions of this paper are the following:
– Introduction of Scalable Semantic Analytics Stack in Browser through Databricks.
– Sample explanatory notebooks for hands-on interaction with RDF data.
– Guideline of how to use third-party Apache Spark frameworks within
Platform as a Service providers (PaaS).
– Showcasing recent machine learning modules and developments of the SANSA
stack.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In recent years, it has been recognized that creating complex technical
environments is a major challenge. Therefore, virtualization environments have been
developed. On the one hand, there are virtualization environments that native allow
an entire image of a complete operating system VMbox 5, Parallels 6, VMware7,
while on the other hand, there are containerization platforms like Docker [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
orchestrated by Swarm8 or Kubernetes 9, which through their architecture
creates a clean replication and scaling of complex technical dependencies. In data
science, libraries like pipenv 10 or poetry 11 are popular, enabling a project level
or repository level encapsulated environment.
3 https://spark.apache.org
4 https://jena.apache.org
5 https://www.virtualbox.org
6 https://www.parallels.com/
7 https://www.vmware.com/
8 https://docs.docker.com/engine/swarm/
9 https://kubernetes.io
10 https://github.com/pypa/pipenv
11 https://python-poetry.org
      </p>
      <p>
        Reproducing and illustrating machine learning pipelines is increasingly
enabled by formats such as markdown documents or, more preferably, notebooks.
Both have the advantage of representing a variety of content in the same
document. There can be text and graphical sections as in classical literature, in
addition to code sections. In the same notebook, the results of the code cells
can be mapped to show, for example, the mapping of a generated data frame,
a figure, or an arbitrary plot. The popular examples of notebooks are Jupyter
Notebook [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Jupiter Lab12 and Zeppelin Notebooks [
        <xref ref-type="bibr" rid="ref12 ref4">4,12</xref>
        ].
      </p>
      <p>The idea of collaborative editing of resources within the browser were first
introduced by ofice solutions from Google Docs13, Next-cloud Ofice 14,
Overleaf 15 and many more. This opportunity to share and edit at the same time in
parallel documents with a group of users was adapted by notebook based
programming platforms like Google-Colab16 and Databricks17.The processing power
is provided by the platform providers. The focus of Google Colab is the default
python data science stack, while Databricks focuses on Spark processing. These
platforms ofer a free plan with limited processing power and functionality,
suficient for most first hands-on notebooks to demonstrate functionalities and usage
of libraries and frameworks.
3</p>
    </sec>
    <sec id="sec-3">
      <title>SANSA through Databricks</title>
      <p>A complex and heterogeneous framework like SANSA18 requires several technical
prerequisites to run initial experiments. On the one hand, the computation is
done in memory and it is crucial to have enough memory to manage the data,
On the other hand, the computation is done on the CPU side. Apache Spark is
designed for multi-core and cluster computation. In order to use the framework,
Apache Spark must be available in the required version (in our case (3.x) and
also Scala in version 2.12. Setting up this hardware and software can be eased
by using Databricks since even in the community edition (free plan), a
twocore system with 15GB of memory is already available. Furthermore, there are
predefined images like the combination of diferent Apache Spark versions and
Scala versions. The following sections will guide through the setup, and explain
working with SANSA on RDF data.
3.1</p>
      <sec id="sec-3-1">
        <title>Get Access to Platform</title>
        <p>
          Databricks[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is one of several Platform as a Service (PaaS) providers. Several
alternatives do not ofer the simplicity of setting up an Apache Spark instance
12 https://jupyterlab.readthedocs.io
13 https://www.google.com/docs/about/
14 https://nextcloud.com/onlyofice/
15 https://www.overleaf.com/
16 https://colab.research.google.com
17 https://databricks.com
18 https://github.com/SANSA-Stack/SANSA-Stack
for use on notebooks and making it accessible through notebooks in a
userfriendly way. For registering Databricks in the free plan, the Community Edition
is suitable. More information can be found on the Databricks FAQ19.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Upload needed Data</title>
        <p>
          Once logged in into the platform, an opportunity is given to import libraries. The
SANSA stack needs to be uploaded as a jar. The jar can be fetched from the
most recent Release in the SANSA stack GitHub page20. The name will be given
automatically according to the filename of the jar. Due to jars size, the upload
process will take a few minutes. After the upload is done, the process can be
conifrmed with Create button. Next, we need to make our desired data available. In
the next step we add the Knowledge Graph data to our Databricks file system.
We will introduce the usage of SANSA stack based on the Linked Movie Database
dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] which is a LOD RDF dataset containing 40 thousand movies like
their title, runtime, list of actors, genres and publish date. This dataset
represents a multimodal Knowledge Graph for several example pipelines. The import
can be started from main pages Import and Explore data. In the overlay menu,
one could drag and drop the file. Other files can be found on web pages like
https://lod-cloud.net or https://www.w3.org/wiki/DataSetRDFDumps. Once the
data is uploaded, the menu shows the path where it got stored. An example might
be: ”FileStore/tables/linkedmdb-18-05-2009-dump.nt”.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Setup Cluster</title>
        <p>One must set up a cluster in the platform used for executing the notebooks. First,
create the cluster as a new cluster and give it a unique name like
SANSA-tryoutcluster. Next, select the fitting image named Spark Runtime Version to the pair
Scala 2.12 and Apache Spark 3.x. Then specify the spark config by pasting the
following three key-value pairs shown in figure 1 and figure 2. They correspond to
the default Databricks and SANSA Spark setup. This Cluster configuration has
to be confirmed with Create Cluster. Confirmation over create cluster opens up
the overview of the respective created cluster. In the Libraries tab, it is needed
to install new the previously uploaded SANSA jar. The uploaded jar is within
the user’s workspace. This process has to be confirmed with install. After some
seconds the SANSA library will change status from installing to installed.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Setup Notebook</title>
        <p>Now, we have to open up or create a desired notebook. Either one can start
with a blank notebook, but it is easier to use the provided sample notebooks21.
These sample notebooks can be imported by using the import option from users’
workspace. In the pop-up window, one can import the notebook over the
note19 https://databricks.com/product/faq
20 https://github.com/SANSA-Stack/SANSA-Stack/releases
21 https://github.com/SANSA-Stack/SANSA-Databricks
1 spark . d a t a b r i c k s . d e l t a . preview . enabled true
2 spark . s e r i a l i z e r org . apache . spark . s e r i a l i z e r . K r y o S e r i a l i z e r
3 spark . kryo . r e g i s t r a t o r net . s a n s a s t a c k . r d f . spark . i o .</p>
        <p>JenaKryoRegistrator , net . s a n s a s t a c k . query . spark .
s p a r q l i f y . K r y o R e g i s t r a t o r S p a r q l i f y
book URL. The import will directly add the notebook to the workspace and
open it up.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5 Execution of Sample Notebooks</title>
        <p>
          The notebook needs to get assigned a cluster. The cluster should be present
as previously configured (see figure 5) and contains the SANSA framework as
a library. After selecting the cluster, it gets attached and will be ready after
some seconds. This enables the execution of notebook cells with SANSA module
functionalities.
The provided notebooks show on the one hand how to read in RDF Knowledge
graphs, how to query data over SPARQL, and execute elements from the ML
layer. One can find many more generic modules for designing the desired
startto-end Apache Spark/SANSA pipeline for RDF Knowledge Graph Analytics and
processing from the SANSA documentation[
          <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
          ]. One of the recent examples
is the DistSim[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] approach, which calculates similarity scores for RDF entities,
which can then be used for various follow-up approaches like Clustering, Entity
Linking, Classification, or Recommendation Systems [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. A complete tutorial
including links to sample notebooks can be found in an uploaded presentation
and within the corresponding GitHub repository[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Two sample notebooks can
directly be found here:
– SANSA DistSim Sample Databricks Notebook [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
– SANSA DistRDF2ML Regression Sample Databricks Notebook [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>This paper demonstrates that a complex and holistic framework for scalable
semantic analytics can be made easily accessible, by showcasing sample notebooks
hosted and running within the service as a platform provider, Databricks. This
guideline ofers the opportunity to take the first steps when exploring and
porting their semantic data analytical pipeline ideas. On the one hand, the need to
have a hardware setup with appropriate computational power and main
memory is not needed for the first step because notebooks are running on Databricks
infrastructure. On the other hand, the installation and handling of installing
the appropriate Scala and Spark Versions is automatically provided. All of the
provided code within the sample notebooks can also run and scale among
Distributed Spark Clusters. Within multiple collaborations we identified the need
of high level RDF data analytic APIs. The partners can solve their use cases
of large scale Knowledge Graph analytics through the generic modules of
DistRDF2ML. The opportunity to postpone technical requirements set up post the
ifrst exploratory work can increase the tryout rate of complex frameworks.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work was partly supported by the EU Horizon 2020 project PLATOON
(Grant agreement ID: 872592). We would also like to thank the SANSA
development team for their helpful support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sejdiu</surname>
          </string-name>
          , L. Bu¨hmann, P. Westphal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          , I. Ermilov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Ngonga Ngomo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          , “
          <article-title>Distributed semantic analytics using the SANSA stack</article-title>
          ,
          <source>” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , vol.
          <volume>10588</volume>
          LNCS, no. iii, pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Boettiger</surname>
          </string-name>
          , “
          <article-title>An introduction to docker for reproducible research,” ACM SIGOPS Operating Systems Review</article-title>
          , vol.
          <volume>49</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>79</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Kluyver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ragan-Kelley</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . Per´ez,
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Granger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bussonnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frederic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Hamrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Corlay</surname>
          </string-name>
          et al.,
          <article-title>Jupyter Notebooks-a publishing format for reproducible computational workflows</article-title>
          .
          <source>Conference: 20th International Conference on Electronic Publishing</source>
          ,
          <year>2016</year>
          , vol.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>I.</given-names>
            <surname>Ermilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sejdiu</surname>
          </string-name>
          , L. Bu¨hmann, P. Westphal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Petzka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          et al., “
          <article-title>The tale of sansa spark</article-title>
          .” in
          <source>International Semantic Web Conference (Posters, Demos &amp; Industry Tracks)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Databricks-Inc</surname>
          </string-name>
          ., “Databricks platform,” https://databricks.com/product/ data-lakehouse,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          , “
          <article-title>Linked movie data base</article-title>
          .”
          <string-name>
            <surname>in</surname>
            <given-names>LDOW</given-names>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Draschner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          , “
          <article-title>Distsim-scalable distributed inmemory semantic similarity estimation for rdf knowledge graphs,” in 2021 IEEE 15th International Conference on Semantic Computing (ICSC)</article-title>
          . IEEE,
          <year>2021</year>
          , pp.
          <fpage>333</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>SANSA-Team</surname>
            <given-names>,</given-names>
          </string-name>
          “
          <article-title>Sansa-stack - distsim github release and documentation</article-title>
          ,” https: //github.com/SANSA-Stack/SANSA-Stack/releases/tag/v0.7.1 DistSim paper,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>T. SANSA</surname>
          </string-name>
          , “
          <article-title>Semantic analytics in the palm of your browser slides</article-title>
          ,” https://github. com/SANSA-Stack/SANSA-Databricks.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>C. F. Draschner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Stadler</surname>
            ,
            <given-names>F. B.</given-names>
          </string-name>
          <string-name>
            <surname>Moghaddam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Jabeen</surname>
          </string-name>
          , “
          <article-title>DistRDF2ML-Scalable distributed in-memory machine learning pipelines for rdf knowledge graphs” in 2021 ACM International Conference on Information and Knowledge Management (CIKM)</article-title>
          .
          <source>ACM</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>F. B. Moghaddam</surname>
            ,
            <given-names>C. F.</given-names>
          </string-name>
          <string-name>
            <surname>Draschner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Jabeen</surname>
          </string-name>
          , “
          <article-title>Literal2Feature: an automatic scalable rdf graph feature extractor</article-title>
          ”
          <source>in Proceedings of the 17th International Conference on Semantic Systems, SEMANTICS</source>
          <year>2021</year>
          , Amsterdam, The Netherlands,
          <source>September 6-9</source>
          ,
          <year>2021</year>
          . SEMANTICS,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>F. B. Moghaddam</surname>
            ,
            <given-names>C. F.</given-names>
          </string-name>
          <string-name>
            <surname>Draschner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Jabeen</surname>
          </string-name>
          , “
          <source>Semantic Web Analysis with Flavor of Micro-Services” in Big Data Analytics 3rd Summer School</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>