=Paper=
{{Paper
|id=Vol-3195/paper1
|storemode=property
|title=Semantic Web Analysis with Flavor of Micro-Services
|pdfUrl=https://ceur-ws.org/Vol-3195/paper1.pdf
|volume=Vol-3195
|authors=Farshad Bakhshandegan Moghaddam,Carsten Felix Draschner,Jens Lehmann,Hajira Jabeen
|dblpUrl=https://dblp.org/rec/conf/lambda-net/MoghaddamD0J21
}}
==Semantic Web Analysis with Flavor of Micro-Services==
Semantic Web Analysis with Flavor of
Micro-Services
Farshad Bakhshandegan Moghaddam1 , Carsten Felix Draschner1 Jens
Lehmann1 , and Hajira Jabeen2
1
University of Bonn, Bonn, Germany
{firstname.lastname}@uni-bonn.de,
2
University of Cologne, Cologne, Germany
hajira.jabeen@uni-koeln.de
Abstract. The last decades witnessed a significant evolution in terms of
data generation, management, and maintenance, especially in the RDF
format. Moreover, in the energy domain, semantic data is finding its
way and can be used for various data analytics tasks. However, since
data set sizes are increasing and can now be enormous, technologies are
evolving to scale with the increasing data set sizes. In this regard, tools
and frameworks such as SANSA have been emerged to facilitate the an-
alytic over semantic data. SANSA is using big data technologies such
as Apache Spark (as an analytics engine for large-scale data processing)
and Apache Hadoop (as a distributed file system) in its backbone to be
able to perform analytics in a distributed manner over a cluster of nodes.
However, to be able to use SANSA, one should set up a cluster of nodes
with enabled Spark and Hadoop. This requires extensive knowledge and
expertise in computer systems, networking, distributed computing and
etc. Moreover, in case of having sufficient technical knowledge, setting up
such a cluster consumes huge manpower and is labor-intensive. To tackle
the aforementioned issues, in this paper we introduce a micro-service
architecture that easily brings the power of SANSA and distributed se-
mantic data analysis in the end-user ecosystem, without having technical
knowledge in the mentioned areas. The introduced architecture is based
on Docker technologies and can be installed on-premise or in the cloud
systems.
Keywords: Micro Services, SANSA, Docker, Distributed Computing,
RDF Data, Analytics Pipelines
1 Introduction
With the rapidly growing amount of data available on the Internet, it becomes
necessary to have a set of tools to extract meaningful and hidden information
from the online data. The Semantic Web is able to form a structural view of the
Copyright © 2021 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
existing data on the web and provides machine-readable formats [1]. Currently,
many companies in the fields of science, engineering, and energy publish their
data in the form of RDF3 . Due to the complex graph nature of RDF data, apply-
ing standard machine learning algorithms to this data is cumbersome. Moreover,
the challenges in the current big data era (limited computational resources) cause
analytical approaches to mostly fail to operate on large-scale data. To tackle
the mentioned issues, Scalable Semantic Analytics Stack (SANSA) [6] has been
emerged. SANSA addresses the need of having a scalable and distributed com-
putational engine to work with semantic data. It benefits from an in-memory
analytics framework, Apache Spark4 and provides fault-tolerant, highly avail-
able, and scalable approaches to efficiently process RDF data with the support
of semantic technology standards. SANSA provides various layers of functional-
ities for semantic data representation, querying, inference, and analytics5 .
To be able to use SANSA effectively, a cluster of Spark nodes with an HDFS
file system is required. Even by having such a cluster, one can only interact
with SANSA via a terminal. Establishing a cluster of different nodes containing
Spark and Hadoop6 and configuring them is by nature a cumbersome task and
needs a lot of knowledge and experiences. Moreover, in the case of having tech-
nical knowledge, establishing such a cluster is a time-consuming task. Moreover,
end-users prefer to have a user friendlier way to interact with SANSA with-
out having any programming and scripting knowledge. Therefore, in this paper,
we introduce a micro-service structure of a SANSA-enabled Spark and Hadoop
cluster with 2 user-friendly interactive communication mechanisms aka. REST
API and Zeppelin Notebook7 . Our introduced architecture is based on Docker
technologies8 . Moreover, our sample explanatory tutorial enables non-technical
users to easily use SANSA without having any specific knowledge and skills.
1.1 Contributions:
- Introducing a micro-service architecture of Big Data tools such as Apache
Spark, Apache Hadoop, Apache Zeppelin, HDFS File Browser, Apache Livy
- Introducing for the first time REST APIs for the SANSA stack
- Introducing an interactive Notebook (i.e. Apache Zeppelin) for interacting
with SANSA
- Making the code and the framework open-source and publicly available on
Github9
3
https://www.w3.org/RDF/
4
http://spark.apache.org/
5
http://sansa-stack.net/
6
https://hadoop.apache.org/
7
https://zeppelin.apache.org/
8
https://www.docker.com/
9
https://github.com/SANSA-Stack/SANSA-Stack
2 Related Work
In recent years, it has been recognized that creating complex technical environ-
ments for big data is a major challenge. Setting up a lot of nodes and connecting
them together in a way that all together perform a specific task not only is
costly but requires extensive knowledge in many computer science areas such as
networking, cluster computing, and big data technologies. Although nowadays,
other technologies such as virtualization (e.g. VMbox3, Parallels, VMware) and
its successor containerization (e.g. Docker, Swarm, Kubernetes) keep the costs
lower but add up other skills to be able to set up a cluster properly.
Besides this, there are numerous centralized machine learning frameworks
and algorithms for RDF data. For example, TensorLog [3] and ProPPR [9]
are recent frameworks for efficient probabilistic inference in first-order logic.
AMIE [5] and AMIE+ [10] learn association rules from RDF data. DL Learner [2]
is a framework for inductive learning for the Semantic Web. [7] provides a re-
view of statistical relational learning techniques for knowledge graphs. However,
SANSA leveraged Apache Spark and Apache Jena10 to provide an open-source,
distributed, scalable, and end-to-end framework for data analytic pipelines for
large-scale RDF Knowledge Graphs[4, 11, 13].
In order to use SANSA effectively, having a distributed environment is in-
evitable. Although, some enterprise companies such as Databricks11 [12] de-
velops a web-based platform for working with Spark, that provides automated
cluster management and IPython-style notebooks, however, they are costly as
the user needs to provide AWS12 , Microsoft Azure13 , or Google Cloud14 account.
Moreover, its free community edition provides limited functionality and services.
However, our approach differs from the mentioned methods as it provides a ver-
satile, flexible, and free-to-use framework via Docker technologies which can be
set up on a single machine (e.g. a laptop) or a cluster of machines in a cloud
environment.
3 Architecture
In this section, we present the micro-service system architecture for using SANSA.
Worth mentioning that the framework is open-source and hosted on GitHub15 .
The main goal of the framework is to bring a simple and effortless approach
to set up a Spark cluster with all the requirements without having extensive
computer science knowledge.
10
https://jena.apache.org
11
https://databricks.com/
12
https://aws.amazon.com/
13
https://azure.microsoft.com
14
https://cloud.google.com/
15
https://github.com/SANSA-Stack/SANSA-Stack
Fig. 1. High-level system overview
3.1 Components
We provided two interaction mechanisms for the end-users. a) Zeppelin Note-
books b) REST APIs. Depends on the scenario users may select one of the men-
tioned mechanisms. These mechanisms cover the full spectrum from simplicity
to flexibility. By using the REST APIs, users will be able to call predefined
functionalities from SANSA without any effort. However, in case a user is in-
terested in the new functionalities, they can write code and stack their code via
Zeppelin Notebook and submit their task to the Spark cluster. Figure 1 depicts
the high-level system overview. The architecture contains 4 main components
i.e. a) Java-based REST APIs b) Apache Zeppelin Notebook c) Apache Livy d)
Spark-Hadoop Cluster, which are all utilized via Docker-Compose.
Java-based REST API This layer provides functionality for the end-user
to interact with SANSA via REST APIs. It is Java-based and is powered by
Spring Boot16 technology and contains a Swagger17 UI which enables users to
easily call any provided functions via a browser. Figure 2 shows the SwaggerUI
and provided APIs. A sample scenario of how to use these APIs is provided in
Section 4.
Apache Livy REST APIs As the Spark tasks may be long-running (up to
a few days) and also there is a chance that the node which is running the task
16
https://spring.io/projects/spring-boot
17
https://swagger.io/
Fig. 2. REST Swagger UI
crashes and loses the calculations, therefore, directly connecting the REST APIs
to the Spark cluster is not feasible due to asynchronous nature of such compu-
tations. To tackle this, another layer has been added by using Apache Livy18 ,
which is able to keep tracks of Spark sessions and calculation states. Livy enables
programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/-
mobile apps. So, multiple users can interact with the Spark cluster concurrently
and reliably. Although the Livy interface is available for the user, this layer works
as a background process and the user does not need to interact with it directly,
because all the functionalities will be handled by the REST API layer.
Spark-Hadoop Cluster To be able to run a SANSA functionality, having a
Spark cluster with a Hadoop file system is inevitable. To do so we containerized
Spark, Hadoop Namenode, Hadoop Datanode, and Hue HDFS file browser19 , in
docker images which are publicly available. Moreover, we configured the contain-
ers to interact with each other seamlessly via Docker-Compose. Of course, all the
other layers have been containerized and exposed in the same docker-compose
file.
18
https://livy.apache.org/
19
https://gethue.com/
4 Usage
To be able to run the cluster, the user needs to clone the SANSA Stack from
Github and navigate to sansa-rest sub-folder. The following codes in the terminal
will bring up the cluster.
$ git clone https://github.com/SANSA-Stack/SANSA-Stack.git
$ cd SANSA-Stack/sansa-rest
$ make
$ make up
To stop the cluster, the user only needs to run the following command.
$ make down
Table 1 lists all the available endpoints and their functionalities.
Table 1. Available endpoints and their functionalities
endpoint functionality
http://localhost Zeppelin Notebook
http://localhost:8085 REST Swagger UI
http://localhost:8998 Livy UI
http://localhost:8080 Spark Master UI
http://localhost:8088/home Hue file browser UI
As already mentioned, users will have two interaction mechanisms to con-
nect to SANSA, either using Zeppelin Notebook or REST APIs. Using Zeppelin
Notebook is easy and straightforward same as all the other notebook technolo-
gies such as Jupyter Notebook20 . Therefore, without losing the generality and
due to the space issue, we ignore the explanation in this paper. However, in the
following, we explain a sample scenario which shows how to use the REST APIs.
Besides its many functionalities, SANSA provides a distributed SPARQL engine
(i.e. Sparklify [8]) which is able to execute a SPARQL query in a distributed
fashion. As a scenario, imagine the user has an RDF file (any format) and would
like to run a SPARQL query over it. To be able to use a REST API for this
purpose, the user first needs to upload her file into the HDFS because SANSA
needs to access the file in a distributed manner. For this reason, we have pro-
vided an API in Swagger which enables the user to upload files to the HDFS.
The result of the API call will be the address which the file will be stored in the
HDFS. The user will need to provide this address for any subsequent API call.
To call the SPARQL engine API (i.e. /api/sparql), the user simply needs to
provide the SPARQL query and the address of the file which he retrieved from
the file upload API. Keep in your mind that the result of any API call will be a
livy batch id. This id is a unique number which identifies a Livy session which
20
https://jupyter.org/
is responsible for the task computation. We provided two APIs which receive
this id and provide more information about the execution process and about
the result of the executions. The result of the /api/getState API will be
either running, success, or dead. Only in case of success, the user can
use /api/getResult API to see the result of the call. The other two states
either show the process is ongoing, or the process is unexpectedly stopped.
5 Conclusion
In this paper, we introduced a framework which is able to set up a Big Data-
enabled cluster with all its requirement from Spark and Hadoop to Zeppelin
Notebook to be able to use SANSA framework without any effort. The pro-
posed approach is based on Docker Compose technology and can be installed
on-premise or on any cloud environment. Moreover, we implemented a set of
REST APIs via Swagger which enables all the non-technical users to interact
with SANSA in a very straightforward manner.
Acknowledgement
This work was partly supported by the EU Horizon 2020 project PLATOON
(Grant agreement ID: 872592). We would also like to thank the SANSA devel-
opment team for their helpful support.
References
1. Berners-Lee, T. A roadmap to the Semantic Web.
(http://www.w3.org/DesignIssues/Semantic.html,1998)
2. Bühmann, L., Lehmann, J. & Westphal, P. DL-Learner - A framework for inductive
learning on the Semantic Web.. J. Web Semant.. 39 pp. 15-24 (2016)
3. Cohen, W. TensorLog: A Differentiable Deductive Database. CoRR.
abs/1605.06523 (2016)
4. Draschner, C., Lehmann, J. & Jabeen, H. DistSim-Scalable Distributed in-Memory
Semantic Similarity Estimation for RDF Knowledge Graphs. 2021 IEEE 15th In-
ternational Conference On Semantic Computing (ICSC). pp. 333-336 (2021)
5. Galárraga, L., Teflioudi, C., Hose, K. & Suchanek, F. Fast rule mining in ontological
knowledge bases with AMIE+.. VLDB J.. 24, 707-730 (2015)
6. Lehmann, J., Sejdiu, G., Bühmann, L., Westphal, P., Stadler, C., Ermilov, I., Bin,
S., Chakraborty, N., Saleem, M., Ngonga, A. & Jabeen, H. Distributed Semantic
Analytics using the SANSA Stack. Proceedings Of 16th International Semantic
Web Conference - Resources Track (ISWC’2017). pp. 147-155 (2017)
7. Nickel, M., Murphy, K., Tresp, V. & Gabrilovich, E. A Review of Relational Ma-
chine Learning for Knowledge Graphs. Proceedings Of The IEEE. 104, 11-33 (2016),
https://doi.org/10.1109/JPROC.2015.2483592
8. Stadler, C., Sejdiu, G., Graux, D. & Lehmann, J. Sparklify: A Scalable Software
Component for Efficient Evaluation of SPARQL Queries over Distributed RDF
Datasets. The Semantic Web - ISWC 2019 - 18th International Semantic Web
Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part II.
11779 pp. 293-308 (2019), https://doi.org/10.1007/978-3-030-30796-7
9. Wang, W., Mazaitis, K. & Cohen, W. Structure Learning via Parameter Learning..
CIKM. pp. 1199-1208 (2014)
10. Galárraga, L., Teflioudi, C., Hose, K. & Suchanek, F. Fast rule mining in ontological
knowledge bases with AMIE+. VLDB J.. 24, 707-730 (2015)
11. C. F. Draschner, C. Stadler, F. B. Moghaddam, J. Lehmann, and H. Jabeen,
“DistRDF2ML-Scalable distributed in-memory machine learning pipelines for rdf
knowledge graphs” in 2021 ACM International Conference on Information and
Knowledge Management (CIKM).
12. C. F. Draschner, F. B. Moghaddam, J. Lehmann, and H. Jabeen, “Semantic An-
alytics in the Palm of Your Browser” in Big Data Analytics 3rd Summer School
(2021).
13. F. B. Moghaddam, C. F. Draschner, J. Lehmann, and H. Jabeen, “Literal2Feature:
an automatic scalable rdf graph feature extractor” in Proceedings of the 17th Inter-
national Conference on Semantic Systems, SEMANTICS 2021, Amsterdam, The
Netherlands, September 6-9, 2021. SEMANTICS (2021)