The Tale of Sansa Spark Ivan Ermilov3 , Jens Lehmann1,2 , Gezim Sejdiu1 , Lorenz Bühmann3 , Patrick Westphal3 , Claus Stadler3 , Simon Bin3 , Nilesh Chakraborty1 , Henning Petzka2 , Muhammad Saleem3,4 , Axel-Cyrille Ngonga Ngomo3,4 , and Hajira Jabeen1 1 University of Bonn, Germany {jens.lehmann, sejdiu, chakrabo, jabeen}@cs.uni-bonn.de 2 Fraunhofer IAIS, Bonn, Germany {jens.lehmann, henning.petzka}@iais.fraunhofer.de 3 Institute for Applied Informatics (InfAI), University of Leipzig, Germany {buehmann,patrick.westphal,cstadler,iermilov,sbin,saleem}@informatik.uni-leipzig.de 4 Paderborn University, Data Science Group, Paderborn, Germany axel.ngonga@uni-paderborn.de Abstract. We demonstrate the open-source Semantic Analytics Stack (SANSA), which can perform scalable analysis of large-scale knowledge graphs to facilitate applications such as link prediction, knowledge base completion and reasoning. The motivation behind this work lies in the lack of scalable methods for analytics which exploit expressive structures underlying semantically structured knowledge bases. The demonstration is based on the BigDataEurope technical platform, which utilizes Docker technology. We present various examples of using SANSA in the form of interactive Spark notebooks, which are executed with Apache Zeppelin. The technical platform and the notebooks are available on SANSA Github and can be deployed on any Docker-enabled host, locally or in a Docker Swarm cluster. 1 Introduction SANSA 5 is an open-source 6 structured data processing engine for performing distributed computation over large-scale RDF datasets [1]. It provides data distribution, scalability and fault tolerance for (1) manipulating large RDF datasets, and (2) applying analytics on the data at scale by making use of cluster-based big data processing engines. In this demonstration paper, we describe a web-based prototype for interacting with SANSA via a web interface. 7 SANSA comes with: (i) specialised serialisation mechanisms and partitioning schemata for RDF, using vertical partitioning strategies, (ii) a scalable query engine for large RDF datasets and different distributed representation formats for RDF, (iii) an adaptive reasoning engine which derives an efficient execution and evaluation plan from a given set of inference rules, (iv) several distributed structured machine learning algorithms that can be applied on large-scale RDF data, and (v) a framework with a unified API that aims to combine distributed in-memory computation technology with semantic technologies. To achieve the goal of storing and manipulating 5 http://sansa-stack.net/ 6 https://github.com/SANSA-Stack 7 Please note that any similarities of the paper title to popular TV series are purely coincidental. large RDF datasets, SANSA leverages existing big data frameworks like Apache Spark and Apache Flink,8 which have matured over the years and offer a reliable method for general-purpose processing of large-scale data. In this demonstration, we will present and describe our implementation of interactive Spark Notebooks for SANSA. 9 These notebooks are a collaborative environment imple- mented as an interactive web editor. They allow access to the SANSA layers and hence provide data scientists, data engineers and students with means to easily use and execute the functionality of SANSA to explore, analyze and learn from large-scale RDF datasets. 2 The SANSA Stack Research efforts in the areas of distributed analytics and semantic technologies have been mostly isolated until now. We aim to proceed one step further by using the semantic modelling standard as a basis for machine learning and data analytics. The layered architecture of SANSA is a direct consequence of this integrated vision and is depicted at the top of Figure 1. For a detailed description of each layer, we refer to [1]. Scalable Semantic Analytics Stack (SANSA) } Machine Learning Inference Analytics Querying Distributed Knowledge Distribution & Representation Distributed In-Memory Processing } Distri- bution Semantic Technology Machine Learning Stack Machine Learning Libraries Distributed Data Sets / Streams In-Memory Computing Framework Distributed Filesystem manual data integration 1 powerful data integration often simple input formats 2 expressive modelling data formats often not standardised 3 W3C standardised formats measurable benefits 4 benefits only indirectly measurable horizontal scalability 5 usually no horizontal scalability Fig. 1. The SANSA framework combines distributed analytics (left) and semantic technologies (right) into a scalable semantic analytics stack (top). The colours encode how the original stacks influence the SANSA stack. A main vision of SANSA is the belief that the the characteristics of each technology stack (bottom) can be combined and retain the respective advantages (figure taken from [1]). 8 http://spark.apache.org and http://flink.apache.org 9 https://github.com/SANSA-Stack/SANSA-Notebooks Hue GUI Hadoop Apache Zeppelin filebrowser RDF Query Inference OWL ML exchange data upload Spark submit download notebook Docker container Spark SANSA notebook Fig. 2. SANSA-Notebooks architecture. 3 SANSA Notebooks SANSA provides Notebooks for an easy local deployment for development and demon- stration purposes. SANSA-Notebooks is an interactive toolkit on top of Hadoop-Spark- Workbench 10 with Apache Zeppelin,11 which allows the copying of files from/to HDFS and an interactive Spark code execution via a web GUI. The architecture of SANSA- Notebooks is depicted in Figure 2. The authors utilize SANSA-Notebooks (see Fig- ure 3) in Big Data labs and courses as they alleviate the complicated Hadoop/Spark setup and allow the students to focus on developing distributed algorithms on top of SANSA. Cluster deployment of the examples is also possible through Docker images (see SANSA-Examples Github repository 12 ). Additionally, SANSA is readily available from the Maven Central Repository. Thus it is straightforward to include it in other projects using Maven or SBT – the most popular build managers for Scala – for both Spark- and Flink-based setups. During the demonstration, we will present the example notebooks. 13 These examples give a quick overview of the SANSA APIs. SANSA is build on the concepts of distributed datasets (i.e RDD, DataFrame, DataSet). A dataset is inferred from the external data, then parallel operations e.g. transformations and actions are applied which trigger a job execution on a cluster. Depending on the network connection, the demonstration will be performed on a local single node cluster or a remote multi node cluster. In the following, we provide a concise description for the examples grouped by the SANSA layers. 1. RDF. (a) Reading and writing triple files from HDFS or file system and some basic triple operations. (b) A distributed evaluation of numerous RDF Dataset Statistics dubbed RDF-Stats (see Figure 3), for example, property distribution, class distribution, distinct subjects/objects/entities as well as statistics summary. (c) Assigning weights to a given entity based on the Spark GraphX PageRank algo- rithm after triples have been transformed to a graph representation (i.e. PageRank for resources). 10 https://github.com/big-data-europe/docker-hadoop-spark-workbench 11 https://zeppelin.apache.org/ 12 https://github.com/SANSA-Stack/SANSA-Examples 13 The source code for all of them is provided at https://github.com/SANSA-Stack/SANSA-Examples. We will present a tale / storyline using different examples across the SANSA layers for booth visitors and adapt them interac- tively (with new parameters, other datasets etc.) in the web browser. Fig. 3. RDF-Stats Spark application running in SANSA-Notebooks with statistics visualization. 2. Query. The example applies Sparqlify,14 which is a SPARQL-to-SQL rewriter, for data partitioning and schema extraction. The queries are executed using the SparkSQL engine. 3. RDF inference. The examples apply a reasoning profile (RDFS Full, RDFS Simple, OWL Horst, Transitive) on a given input file with an optimised execution plan. 4. OWL. The examples provided for the OWL layer demonstrate the process of loading an OWL file into Spark RDD, a Spark Dataset, or a Flink DataSet. 5. Machine Learning. (a) Clustering algorithms. Three examples for different clustering algorithms are provided, namely power iteration clustering, BorderFlow and modularity clus- tering. They all take an RDF graph as input and return the list of triples for each of the different clusters. (b) Rule mining. This example applies association rule mining on a given RDF knowledge base. The output is the set of closed Horn rules that satisfy a support- confidence threshold. One of the powerful features of the SANSA Notebooks is that you can view the result set of the previous session within the Spark framework and, in case you have found some insight for your data and would like to share, you can easily create a report and either print or send it. References 1. J. Lehmann, G. Sejdiu, L. Bühmann, P. Westphal, C. Stadler, I. Ermilov, S. Bin, N. Chakraborty, M. Saleem, A.-C. Ngonga Ngomo, and H. Jabeen. Distributed Semantic Analytics using the SANSA Stack. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC’2017), 2017. 14 http://aksw.org/Projects/Sparqlify.html