Semantic Web Analysis with Flavor of
                  Micro-Services

    Farshad Bakhshandegan Moghaddam1 , Carsten Felix Draschner1 Jens
                     Lehmann1 , and Hajira Jabeen2
                       1
                        University of Bonn, Bonn, Germany
                    {firstname.lastname}@uni-bonn.de,
                    2
                      University of Cologne, Cologne, Germany
                        hajira.jabeen@uni-koeln.de


      Abstract. The last decades witnessed a significant evolution in terms of
      data generation, management, and maintenance, especially in the RDF
      format. Moreover, in the energy domain, semantic data is finding its
      way and can be used for various data analytics tasks. However, since
      data set sizes are increasing and can now be enormous, technologies are
      evolving to scale with the increasing data set sizes. In this regard, tools
      and frameworks such as SANSA have been emerged to facilitate the an-
      alytic over semantic data. SANSA is using big data technologies such
      as Apache Spark (as an analytics engine for large-scale data processing)
      and Apache Hadoop (as a distributed file system) in its backbone to be
      able to perform analytics in a distributed manner over a cluster of nodes.
      However, to be able to use SANSA, one should set up a cluster of nodes
      with enabled Spark and Hadoop. This requires extensive knowledge and
      expertise in computer systems, networking, distributed computing and
      etc. Moreover, in case of having sufficient technical knowledge, setting up
      such a cluster consumes huge manpower and is labor-intensive. To tackle
      the aforementioned issues, in this paper we introduce a micro-service
      architecture that easily brings the power of SANSA and distributed se-
      mantic data analysis in the end-user ecosystem, without having technical
      knowledge in the mentioned areas. The introduced architecture is based
      on Docker technologies and can be installed on-premise or in the cloud
      systems.

      Keywords: Micro Services, SANSA, Docker, Distributed Computing,
      RDF Data, Analytics Pipelines


1   Introduction

With the rapidly growing amount of data available on the Internet, it becomes
necessary to have a set of tools to extract meaningful and hidden information
from the online data. The Semantic Web is able to form a structural view of the


Copyright © 2021 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
existing data on the web and provides machine-readable formats [1]. Currently,
many companies in the fields of science, engineering, and energy publish their
data in the form of RDF3 . Due to the complex graph nature of RDF data, apply-
ing standard machine learning algorithms to this data is cumbersome. Moreover,
the challenges in the current big data era (limited computational resources) cause
analytical approaches to mostly fail to operate on large-scale data. To tackle
the mentioned issues, Scalable Semantic Analytics Stack (SANSA) [6] has been
emerged. SANSA addresses the need of having a scalable and distributed com-
putational engine to work with semantic data. It benefits from an in-memory
analytics framework, Apache Spark4 and provides fault-tolerant, highly avail-
able, and scalable approaches to efficiently process RDF data with the support
of semantic technology standards. SANSA provides various layers of functional-
ities for semantic data representation, querying, inference, and analytics5 .
    To be able to use SANSA effectively, a cluster of Spark nodes with an HDFS
file system is required. Even by having such a cluster, one can only interact
with SANSA via a terminal. Establishing a cluster of different nodes containing
Spark and Hadoop6 and configuring them is by nature a cumbersome task and
needs a lot of knowledge and experiences. Moreover, in the case of having tech-
nical knowledge, establishing such a cluster is a time-consuming task. Moreover,
end-users prefer to have a user friendlier way to interact with SANSA with-
out having any programming and scripting knowledge. Therefore, in this paper,
we introduce a micro-service structure of a SANSA-enabled Spark and Hadoop
cluster with 2 user-friendly interactive communication mechanisms aka. REST
API and Zeppelin Notebook7 . Our introduced architecture is based on Docker
technologies8 . Moreover, our sample explanatory tutorial enables non-technical
users to easily use SANSA without having any specific knowledge and skills.


1.1    Contributions:

    - Introducing a micro-service architecture of Big Data tools such as Apache
      Spark, Apache Hadoop, Apache Zeppelin, HDFS File Browser, Apache Livy
    - Introducing for the first time REST APIs for the SANSA stack
    - Introducing an interactive Notebook (i.e. Apache Zeppelin) for interacting
      with SANSA
    - Making the code and the framework open-source and publicly available on
      Github9

3
  https://www.w3.org/RDF/
4
  http://spark.apache.org/
5
  http://sansa-stack.net/
6
  https://hadoop.apache.org/
7
  https://zeppelin.apache.org/
8
  https://www.docker.com/
9
  https://github.com/SANSA-Stack/SANSA-Stack
2    Related Work

In recent years, it has been recognized that creating complex technical environ-
ments for big data is a major challenge. Setting up a lot of nodes and connecting
them together in a way that all together perform a specific task not only is
costly but requires extensive knowledge in many computer science areas such as
networking, cluster computing, and big data technologies. Although nowadays,
other technologies such as virtualization (e.g. VMbox3, Parallels, VMware) and
its successor containerization (e.g. Docker, Swarm, Kubernetes) keep the costs
lower but add up other skills to be able to set up a cluster properly.
    Besides this, there are numerous centralized machine learning frameworks
and algorithms for RDF data. For example, TensorLog [3] and ProPPR [9]
are recent frameworks for efficient probabilistic inference in first-order logic.
AMIE [5] and AMIE+ [10] learn association rules from RDF data. DL Learner [2]
is a framework for inductive learning for the Semantic Web. [7] provides a re-
view of statistical relational learning techniques for knowledge graphs. However,
SANSA leveraged Apache Spark and Apache Jena10 to provide an open-source,
distributed, scalable, and end-to-end framework for data analytic pipelines for
large-scale RDF Knowledge Graphs[4, 11, 13].
    In order to use SANSA effectively, having a distributed environment is in-
evitable. Although, some enterprise companies such as Databricks11 [12] de-
velops a web-based platform for working with Spark, that provides automated
cluster management and IPython-style notebooks, however, they are costly as
the user needs to provide AWS12 , Microsoft Azure13 , or Google Cloud14 account.
Moreover, its free community edition provides limited functionality and services.
However, our approach differs from the mentioned methods as it provides a ver-
satile, flexible, and free-to-use framework via Docker technologies which can be
set up on a single machine (e.g. a laptop) or a cluster of machines in a cloud
environment.


3    Architecture

In this section, we present the micro-service system architecture for using SANSA.
Worth mentioning that the framework is open-source and hosted on GitHub15 .
The main goal of the framework is to bring a simple and effortless approach
to set up a Spark cluster with all the requirements without having extensive
computer science knowledge.
10
   https://jena.apache.org
11
   https://databricks.com/
12
   https://aws.amazon.com/
13
   https://azure.microsoft.com
14
   https://cloud.google.com/
15
   https://github.com/SANSA-Stack/SANSA-Stack
                           Fig. 1. High-level system overview


3.1     Components
We provided two interaction mechanisms for the end-users. a) Zeppelin Note-
books b) REST APIs. Depends on the scenario users may select one of the men-
tioned mechanisms. These mechanisms cover the full spectrum from simplicity
to flexibility. By using the REST APIs, users will be able to call predefined
functionalities from SANSA without any effort. However, in case a user is in-
terested in the new functionalities, they can write code and stack their code via
Zeppelin Notebook and submit their task to the Spark cluster. Figure 1 depicts
the high-level system overview. The architecture contains 4 main components
i.e. a) Java-based REST APIs b) Apache Zeppelin Notebook c) Apache Livy d)
Spark-Hadoop Cluster, which are all utilized via Docker-Compose.

Java-based REST API This layer provides functionality for the end-user
to interact with SANSA via REST APIs. It is Java-based and is powered by
Spring Boot16 technology and contains a Swagger17 UI which enables users to
easily call any provided functions via a browser. Figure 2 shows the SwaggerUI
and provided APIs. A sample scenario of how to use these APIs is provided in
Section 4.

Apache Livy REST APIs As the Spark tasks may be long-running (up to
a few days) and also there is a chance that the node which is running the task
16
     https://spring.io/projects/spring-boot
17
     https://swagger.io/
                                Fig. 2. REST Swagger UI


crashes and loses the calculations, therefore, directly connecting the REST APIs
to the Spark cluster is not feasible due to asynchronous nature of such compu-
tations. To tackle this, another layer has been added by using Apache Livy18 ,
which is able to keep tracks of Spark sessions and calculation states. Livy enables
programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/-
mobile apps. So, multiple users can interact with the Spark cluster concurrently
and reliably. Although the Livy interface is available for the user, this layer works
as a background process and the user does not need to interact with it directly,
because all the functionalities will be handled by the REST API layer.


Spark-Hadoop Cluster To be able to run a SANSA functionality, having a
Spark cluster with a Hadoop file system is inevitable. To do so we containerized
Spark, Hadoop Namenode, Hadoop Datanode, and Hue HDFS file browser19 , in
docker images which are publicly available. Moreover, we configured the contain-
ers to interact with each other seamlessly via Docker-Compose. Of course, all the
other layers have been containerized and exposed in the same docker-compose
file.


18
     https://livy.apache.org/
19
     https://gethue.com/
4      Usage
To be able to run the cluster, the user needs to clone the SANSA Stack from
Github and navigate to sansa-rest sub-folder. The following codes in the terminal
will bring up the cluster.
$ git clone https://github.com/SANSA-Stack/SANSA-Stack.git
$ cd SANSA-Stack/sansa-rest
$ make
$ make up

      To stop the cluster, the user only needs to run the following command.
$ make down

      Table 1 lists all the available endpoints and their functionalities.


                  Table 1. Available endpoints and their functionalities

                    endpoint                   functionality
                    http://localhost           Zeppelin Notebook
                    http://localhost:8085      REST Swagger UI
                    http://localhost:8998      Livy UI
                    http://localhost:8080      Spark Master UI
                    http://localhost:8088/home Hue file browser UI


    As already mentioned, users will have two interaction mechanisms to con-
nect to SANSA, either using Zeppelin Notebook or REST APIs. Using Zeppelin
Notebook is easy and straightforward same as all the other notebook technolo-
gies such as Jupyter Notebook20 . Therefore, without losing the generality and
due to the space issue, we ignore the explanation in this paper. However, in the
following, we explain a sample scenario which shows how to use the REST APIs.
Besides its many functionalities, SANSA provides a distributed SPARQL engine
(i.e. Sparklify [8]) which is able to execute a SPARQL query in a distributed
fashion. As a scenario, imagine the user has an RDF file (any format) and would
like to run a SPARQL query over it. To be able to use a REST API for this
purpose, the user first needs to upload her file into the HDFS because SANSA
needs to access the file in a distributed manner. For this reason, we have pro-
vided an API in Swagger which enables the user to upload files to the HDFS.
The result of the API call will be the address which the file will be stored in the
HDFS. The user will need to provide this address for any subsequent API call.
To call the SPARQL engine API (i.e. /api/sparql), the user simply needs to
provide the SPARQL query and the address of the file which he retrieved from
the file upload API. Keep in your mind that the result of any API call will be a
livy batch id. This id is a unique number which identifies a Livy session which
20
     https://jupyter.org/
is responsible for the task computation. We provided two APIs which receive
this id and provide more information about the execution process and about
the result of the executions. The result of the /api/getState API will be
either running, success, or dead. Only in case of success, the user can
use /api/getResult API to see the result of the call. The other two states
either show the process is ongoing, or the process is unexpectedly stopped.


5    Conclusion
In this paper, we introduced a framework which is able to set up a Big Data-
enabled cluster with all its requirement from Spark and Hadoop to Zeppelin
Notebook to be able to use SANSA framework without any effort. The pro-
posed approach is based on Docker Compose technology and can be installed
on-premise or on any cloud environment. Moreover, we implemented a set of
REST APIs via Swagger which enables all the non-technical users to interact
with SANSA in a very straightforward manner.


Acknowledgement
This work was partly supported by the EU Horizon 2020 project PLATOON
(Grant agreement ID: 872592). We would also like to thank the SANSA devel-
opment team for their helpful support.


References
 1. Berners-Lee,        T.      A      roadmap      to      the       Semantic       Web.
    (http://www.w3.org/DesignIssues/Semantic.html,1998)
 2. Bühmann, L., Lehmann, J. & Westphal, P. DL-Learner - A framework for inductive
    learning on the Semantic Web.. J. Web Semant.. 39 pp. 15-24 (2016)
 3. Cohen, W. TensorLog: A Differentiable Deductive Database. CoRR.
    abs/1605.06523 (2016)
 4. Draschner, C., Lehmann, J. & Jabeen, H. DistSim-Scalable Distributed in-Memory
    Semantic Similarity Estimation for RDF Knowledge Graphs. 2021 IEEE 15th In-
    ternational Conference On Semantic Computing (ICSC). pp. 333-336 (2021)
 5. Galárraga, L., Teflioudi, C., Hose, K. & Suchanek, F. Fast rule mining in ontological
    knowledge bases with AMIE+.. VLDB J.. 24, 707-730 (2015)
 6. Lehmann, J., Sejdiu, G., Bühmann, L., Westphal, P., Stadler, C., Ermilov, I., Bin,
    S., Chakraborty, N., Saleem, M., Ngonga, A. & Jabeen, H. Distributed Semantic
    Analytics using the SANSA Stack. Proceedings Of 16th International Semantic
    Web Conference - Resources Track (ISWC’2017). pp. 147-155 (2017)
 7. Nickel, M., Murphy, K., Tresp, V. & Gabrilovich, E. A Review of Relational Ma-
    chine Learning for Knowledge Graphs. Proceedings Of The IEEE. 104, 11-33 (2016),
    https://doi.org/10.1109/JPROC.2015.2483592
 8. Stadler, C., Sejdiu, G., Graux, D. & Lehmann, J. Sparklify: A Scalable Software
    Component for Efficient Evaluation of SPARQL Queries over Distributed RDF
    Datasets. The Semantic Web - ISWC 2019 - 18th International Semantic Web
    Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part II.
    11779 pp. 293-308 (2019), https://doi.org/10.1007/978-3-030-30796-7
 9. Wang, W., Mazaitis, K. & Cohen, W. Structure Learning via Parameter Learning..
    CIKM. pp. 1199-1208 (2014)
10. Galárraga, L., Teflioudi, C., Hose, K. & Suchanek, F. Fast rule mining in ontological
    knowledge bases with AMIE+. VLDB J.. 24, 707-730 (2015)
11. C. F. Draschner, C. Stadler, F. B. Moghaddam, J. Lehmann, and H. Jabeen,
    “DistRDF2ML-Scalable distributed in-memory machine learning pipelines for rdf
    knowledge graphs” in 2021 ACM International Conference on Information and
    Knowledge Management (CIKM).
12. C. F. Draschner, F. B. Moghaddam, J. Lehmann, and H. Jabeen, “Semantic An-
    alytics in the Palm of Your Browser” in Big Data Analytics 3rd Summer School
    (2021).
13. F. B. Moghaddam, C. F. Draschner, J. Lehmann, and H. Jabeen, “Literal2Feature:
    an automatic scalable rdf graph feature extractor” in Proceedings of the 17th Inter-
    national Conference on Semantic Systems, SEMANTICS 2021, Amsterdam, The
    Netherlands, September 6-9, 2021. SEMANTICS (2021)