STATisfy Me: What are my Stats? Gezim Sejdiu1 , Ivan Ermilov2 , Mohamed Nadjib Mami1,3 and Jens Lehmann1,3 1 Smart Data Analytics, University of Bonn, Germany sejdiu@cs.uni-bonn.de, mami@cs.uni-bonn.de, jens.lehmann@cs.uni-bonn.de 2 Department of Computer Science, University of Leipzig, 04109 Leipzig, Germany iermilov@informatik.uni-leipzig.de 3 Fraunhofer IAIS, Germany mohamed.nadjib.mami@iais.fraunhofer.de, jens.lehmann@iais.fraunhofer.de Abstract. The increasing adoption of the Linked Data format, RDF, over the last two decades has brought new opportunities. It has also raised new challenges though, especially when it comes to managing and processing large amounts of RDF data. In particular, assessing the internal structure of a data set is important, since it enables users to understand the data better. One prominent way of assess- ment is computing statistics about the instances and schema of a data set. How- ever, computing statistics of large RDF data is computationally expensive. To overcome this challenging situation, we previously built DistLODStats, a frame- work for parallel calculation of 32 statistical criteria over large RDF datasets, based on Apache Spark. Running DistLODStats is, thus, done via submitting jobs to a Spark cluster. Often times, this process is done manually, either by con- necting to the cluster machine or via a dedicated resource manager. This approach is inconvenient as it requires acquiring new software skills as well as the direct interaction of users with the cluster. In order to make the use of DistLODStats easier, we propose in this paper an approach for triggering RDF statistics calcula- tion remotely simply using HTTP requests. DistLODStats is built as a plugin into the larger SANSA Framework and makes use of Apache Livy, a novel lightweight solution for interacting with Spark cluster via a REST Interface. 1 Introduction SANSA [3] is an open source framework4 that allows RDF processing at scale. It pro- vides a set of libraries for executing SPARQL queries, performing inference as well as analytics over knowledge graphs, all while supporting several RDF representations. In addition, it provides support for RDF dataset statistics and quality assessment for large-scale RDF datasets. The statistics are calculatated using the dedicated compo- nent DistLODStats [4], which is a distributed and scalable software able to compute 32 statistical criteria (intially proposed at [1]). SANSA and DistLODStats use Apache Spark5 as an underlying engine, which is a popular framework for processing large datasets in-memory. Spark provides two possi- bilities of running and interacting with applications: 4 https://github.com/SANSA-Stack 5 http://spark.apache.org/ – Interactive - via a command line interface (CLI) called Spark Shell, or via Spark Notebooks (e.g. SANSA-Notebooks [2]), – Batch - which includes a bash script called spark-submit used to submit a Spark application to the cluster without interaction during run time. Spark application is usually launched by logging first into a cluster, either in the premises or remotely in the cloud. This process presents several difficulties: – It requires a sophisticated user access control management, which may become hard to maintain with multiple users. – It raises the chances of exhausting the cluster or even causing its failure. – It exposes cluster and its configurations to all the users with access. In order to elevate those, we have investigated Apache Livy 6 – a novel open source REST interface for interacting remotely with Apache Spark. It supports executing snip- pets of code or programs in a Spark context that runs locally, in a Spark cluster or in Apache Hadoop YARN. This is an accompanying poster paper for DistLODStats [4], which was accepted at the ISWC resource track. The addition made in this poster is an interactive REST API for DistLODStats, which enables calculating RDF dataset statistics remotely i.e., without a direct contact with the hosting cluster. 2 STATisfy: A REST Interface for DistLODStats Traditionally, when running a Spark job, submitting it to a Spark cluster is done via a spark-shell or spark-submit. Usually, this process is done manually either entering the cluster gateway machines or via a dedicated resource manager (e.g. SLURM, Open- Stack). Fig. 1. STATisfy overview architecture. 6 https://livy.incubator.apache.org/ For users with little experience in cluster management and the Hadoop infrastruc- ture, it can be challenging to run Spark. As an alternative, we introduce STATisfy7 : REST Interface for DistLODStats. Instead of computing RDF statistics directly on the cluster the interaction is done via REST APIs (as it is depicted in the Figure 1). Fig. 2. STATisfy used on LODStats system. The client side will create a remote Spark cluster for initialization, and submit jobs through REST APIs. Livy REST Server will then discover this job and send through remote procedure call (RPC) to SparkSession, where the code will be initialized and executed. In the meantime, the client will be waiting for the result of this job coming from the same direction. Running the STATisfy is similar to using DistLODStats via spark-submit. The dif- ference is that this shell is not running locally, instead, it runs in a cluster and transfers the data back and forth through the network. For demonstrating the usage of the tool, we have deployed it on the comprehensive statistics catalogue LODStats8 which crawls RDF data from metadata portals such as CKAN dataset metadata registry. By doing this, it obtains a comprehensive picture of the current state of the Web of Data. As we use DistLODStats as an underlying engine for computing RDF statistics afterwards, the limitation was that the user has to interact with the cluster manually and initiate the job for computing such statistics. By using STATisfy REST interface, LODStats will interact with the cluster from anywhere which provides the capabilities necessary to do this without compromising on ease of use or security. As it is shown on the Figure 2, user starts a session via REST API using Livy for submitting a job to the Spark cluster. 7 https://github.com/GezimSejdiu/STATisfy 8 http://lodstats.aksw.org/ Listing 1.1. DistLODStats example REST call. curl -X POST -H "Content -Type: application /json" sansa - stack .net: 8998/ batches --data ’{ "file": "hdfs :./// tmp/REST/sansa -rdf - stats .jar", " className ": "net. sansa_stack . examples . spark .rdf. RDFStats ", "name": " SANSA RDF Dataset Statistics ", " executorCores ":1, " executorMemory ":"512 m", " driverCores ":1 , " driverMemory ":"512m", "args ":[" -i hdfs :./// input .nt "]} ’ The script (see Listing 1.1) contains a spark-submit configurations which is given in the format of a JSON structure with the necessary information like spark-submit. With the POST request POST /batches user could submit a request to DistLODStats using Livy server. Using Livy, STATisfy will then help to launch this request in the cluster. As a result, the output will be curled by their end in the format of VoID description. 3 Conclusions In order to deepen their understanding of the data, many users require gathering statisti- cal information about RDF datasets. This process becomes compute-intensive when the datasets grow in size. DistLODStats is a prominent solution, however, it requires setup and managing of the the cluster configuration and job submission. To make the pro- cess easier, we have introduced STATisfy, a tool for interacting with DistLODStats via a REST Interface. This way DistLODStats can be provided as-a-service, where users only send (HTTP) requests to the remote cluster and obtain the wished results, without having any knowledge about system access or cluster management. STATisfy is used for the LODStats project and an inclusion in the new DBpedia9 community release processes is ongoing. References 1. J. Demter, S. Auer, M. Martin, and J. Lehmann. Lodstats—an extensible framework for high- performance dataset analytics. In Proceedings of the EKAW 2012, Lecture Notes in Computer Science (LNCS) 7603. Springer, 2012. 2. I. Ermilov, J. Lehmann, G. Sejdiu, L. Bühmann, P. Westphal, C. Stadler, S. Bin, N. Chakraborty, H. Petzka, M. Saleem, A.-C. N. Ngonga, and H. Jabeen. The Tale of Sansa Spark. In Proceedings of 16th International Semantic Web Conference, Poster & Demos, 2017. 3. J. Lehmann, G. Sejdiu, L. Bühmann, P. Westphal, C. Stadler, I. Ermilov, S. Bin, N. Chakraborty, M. Saleem, A.-C. N. Ngonga, and H. Jabeen. Distributed Semantic Analyt- ics using the SANSA Stack. In Proceedings of 16th International Semantic Web Conference, 2017. 4. G. Sejdiu, I. Ermilov, J. Lehmann, and M. Nadjib-Mami. DistLODStats: Distributed Compu- tation of RDF Dataset Statistics. In Proceedings of 17th International Semantic Web Confer- ence, 2018. 9 https://wiki.dbpedia.org/