RDF-Connect: A declarative framework for streaming and cross-environment data processing pipelines

RDF-Connect: A declarative framework for streaming and cross-environment data processing pipelines ArthurVercruysse arthur.vercruysse@ugent.be Department of Electronics and Information Systems IDLab Ghent University -imec JensPots jens.pots@ugent.be Department of Electronics and Information Systems IDLab Ghent University -imec JuliánRojas Meléndez julianandres.rojasmelendez@ugent.be Department of Electronics and Information Systems IDLab Ghent University -imec PieterColpaert pieter.colpaert@ugent.be Department of Electronics and Information Systems IDLab Ghent University -imec

Baltimore USA

RDF-Connect: A declarative framework for streaming and cross-environment data processing pipelines 1613-0073 21384E4C75678CD145E2F2B0369B4E4E GROBID - A machine learning software for extracting information from scholarly documents Data pipeline RDF Streaming Interoperability

Data processing pipelines are a crucial component of any data-centric system today. Machine learning, data integration, and knowledge graph publishing are examples where data processing pipelines are needed. Furthermore, most production systems require data pipelines that support continuous operation and streaming-based capabilities for low-latency computations over large volumes of data. However, creation and maintenance of data processing pipelines is challenging and a lot of effort is usually spent on ad-hoc scripting, which limits reusability across systems. Existing solutions are not interoperable out-of-the-box and do not allow for easy integration of different execution environments (e.g., Java, Python, JavaScript, Rust, etc), while maintaining a streaming operation. For example, combining Python, JavaScript and Java-based libraries natively in a single pipeline is not straightforward. An interoperable and declarative mechanism could allow for continuous communication and integrated execution of data processing functions across different execution environments. We introduce RDF-Connect, a declarative framework based on semantic standards that enables instantiating pipelines with data processing functions across execution environments communicating through well-known communication protocols. We describe its architecture and demonstrate its use for an RDF knowledge graph creation, validation and publishing use case. The declarative nature of our approach facilitates reusability and maintainability of data processing pipelines. We currently support JavaScript and JVM-based environments but we aim to extend RDF-Connect support to other rich ecosystems such as Python and to lower-level languages such as Rust, to take advantage of system-level performance gains.

Introduction

Modern data-centric software systems are built supported on complex data processing operations. The design and implementation of these operations are organized into data pipelines, which are sequences of data processing components where the output of one component is the input of the next, thus enabling smooth data flows towards a common goal [1]. Data pipelines are present at the core of datadependent tasks such as data science, where pipelines are used for acquisition, curation and analysis of data [2]; machine learning (ML), where pipelines support the preparation, training, validation and cleaning of data [3]; knowledge graph construction and publishing, where pipelines are used to generate semantic annotations from heterogeneous sources, validate (e.g., with SHACL) and load data into a graph database [4].

Recently, the need for streaming data pipelines has increased due to the demand for low-latency computations. Production systems often need to process high volumes of data at high velocity, rendering traditional batch processing methods insufficient [5,6]. Batch processing systems suffer from latency issues since they need to collect input data into batches before it can be processed any further [7]. Streaming data pipelines are designed to continuously process data as soon as possible, which is required for (near) real-time analytics, monitoring and reacting to changes in data sources. Internet of Things (IoT) scenarios are a typical example where streaming data pipelines are essential [8].

Despite how much data pipelines can aid on data management challenges through automation, monitoring, fault detection, etc; modeling and implementing data pipelines often require significant effort and expertise [1]. Moreover, data pipelines are often implemented as ad-hoc scripts [9] which are error-prone and difficult to maintain and reuse across different systems [10]. A myriad of data pipeline frameworks exist, each with its own strengths and weaknesses [11,7,12,13]. However, most of these frameworks are not interoperable out-of-the-box (or only provide support for a few other specific ones) and do not allow for easy integration of different execution environments. For instance, combining Python, JavaScript and Java-based data processing libraries natively (i.e., each executed in their own native environment) in a single pipeline is not straightforward. This is particularly important, for example, in the case of RDF knowledge graph processing pipelines, where despite the maturation of RDF libraries (e.g., Apache Jena, RDFLib, RDF-JS-based libraries, etc), robust implementations of well-defined operations, like RDF generation or SHACL validation, are still scarce and are only available for few languages. This raises the need to connect the best implementations for each operation seamlessly, and while shell scripts can facilitate this, they are difficult to maintain and reuse.

The need for multilingual pipelines also arises when certain critical parts of a pipeline demand substantial resources. A pipeline should manage the required load efficiently, If a component is not efficient, it should be replaceable with a more performant version in a different language. The challenge lies not in handling the data across different programming languages but in connecting these languages effectively. Conversely, not all pipeline components are critical and may adhere to different standards. Interacting with a complex internal API to aggregate data may be more easily implemented using the company's primary languages, leveraging existing SDKs and expertise. In general, a pipeline should not mandate a particular language for specific tasks.

Designing an interoperable and declarative framework that allows for easy description, continuous communication and integrated execution of data processing functions across different execution environments, is the main goal of this work. The framework should allow for clear separation of concerns (i.e., high-level workflow definition from step-level implementation and deployment details), while focusing on the reusability of data pipelines [11], that could allow for easy deployment of new and similar data pipelines with minimal effort and adjustments. We also aim on facilitating testing, benchmarking and comparison of different data processing tools implemented in different languages.

With that goal in mind, we set out to create a declarative streaming pipeline architecture to address the previously mentioned challenges.

Streaming:

The architecture should focus on streaming pipelines, allowing for continuous data processing.

Multi-lingual:

The architecture should not be built around a single language and should abstract away language specifics.

In this paper, we introduce RDF-Connect, a declarative, multilingual and streaming pipeline framework. The framework defines a simple RDF vocabulary and architecture that abstracts languages by using language-specific Runners, which are able to communicate over language-agnostic Channels. These channels enable the streaming of data as messages, facilitating the construction of streaming pipelines. The pipeline and its components (Processors) are described using SHACL shapes, which allows to declare the expected input and output data types and constraints of each processor. To demonstrate the fulfillment of the proposed requirements, we describe a real use case, which consists of a pipeline that annotates, validates and loads incoming sensor data into an RDF triple store. Currently RDF-Connect supports JavaScript and JVM-based environments. In the future, we aim to extend it to support rich environments such as Python and lower-level languages such as Rust, to take advantage of system-level performance gains.

The remainder of the paper is structured as follows. First, we present an overview of related work, noting the ubiquity of pipelining frameworks. Next, we describe RDF-Connect, discussing various design decisions and the provision of multiple runners to support multi-lingual pipelines. The subsequent section4 elaborates on an existing pipeline used in the field, addressing some of the hurdles that were overcome. The final sections cover the conclusion and future directions.

Related work

In this section, we present an overview related works that address the requirements, characteristics, and challenges of data processing pipelines.

Data pipelines

Foidl et al. [10] provide a comprehensive overview of data pipelines, starting with a definition where a data processing pipeline can be understood from a theoretical perspective, as a directed acyclic graph (DAG) composed of a sequence of nodes that process data; and from a practical perspective, as a piece of software that automates the manipulation of data and moves them from diverse source systems to defined destinations [10]. The authors also define a generic architecture for data pipelines, which consists of three main components:

1. Data Source: The origin of the data, which can be a database, a file, a message queue, or a Web API. 2. Data Processing: The transformation of the data, which can include filtering, aggregation, enrichment, and validation. 3. Data Sink: The destination of the data, which can be other pipelines, an application or external storage systems.

Lastly, they mention a classification of data pipelines based on their processing characteristics, such as processing mode (batch or streaming), data flow (ETL (Extract-Transform-Load) or ELT (Extract-Load-Transform)), and use case (visualization, analysis tools, ML and deep learning applications, or data mining) [10].

Similarly, Matskin et al. [11] provide a survey of big data pipeline orchestration tools. They propose a set of criteria for evaluating pipeline tools and frameworks, emphasizing reusability, flexible pipeline communication modes, and separation of concerns. The authors use these criteria to analyze 37 different tools for big data pipeline orchestration. They also note that few tools support a graphical input language for the description of pipelines, and they conclude that Apache Airflow (open-source), Argo Workflow and Snakemake (closed-source) are the best suited frameworks for their needs, despite Airflow failing to fullfil the separation of concerns criterion and lacking support for streaming processing [11]. Among the criteria, the authors do not directly consider the support for multilingual pipelines, however the step-level containerization could be seen as a way to support different languages. Thanks to its language-oriented Runners, RDF-Connect can support native multilingual execution without resorting to containerization, although pipeline-level containerization can be used for facilitating dependency management.

Mbata et al. [12] also provide a survey of pipeline tools for data engineering. In this survey, pipelines are also categorised according to their characteristics, such as (i) ETL/ELT pipelines, (ii) data integration, ingestion and transformation pipelines, (iii) orchestration and workflow management pipelines, and (iv) ML pipelines. Apache Spark-based pipelines are highlighted as a popular choice for ETL/ELT data processing with support for Scala, Java, Python and R, although its performance is compromised when used for large pipelines. Apache Kafka and Apache NiFi are mentioned as streaming integration tools. Kafka provides interfaces for multiple languages and is streaming oriented but entails a steep learning curve. NiFi is user-friendly and supports live streaming pipelines, but is mainly Java-based. For orchestration and workflow management, Apache Airflow and Apache Beam are highlighted. Apache Airflow and Apache Beam primarily concentrate on running pipelines defined in a single language across different execution environments, with a particular emphasis on big data. Although Beam supports multilingual pipelines, its implementation is highly verbose. Beam relies on job servers to execute parts in a different language which has to be set up before starting the pipeline. The connections to these job servers are complex to code and challenging to modify. On the other hand, Apache Airflow does not support declarative pipelines, streaming data, nor multilingual pipelines (being mainly Python-oriented).

Related works focused on streaming pipelines include the work of Isah et al. [7] which surveys distributed data stream processing frameworks such as Apache Flink, Apache Storm, Apache Spark Streams, and Apache Kafka Streams. They define a taxonomy for these frameworks based on their characteristics. For instance the programming model (native or micro-batch) and the type of transformations supported per record (Map, Filter and FlatMap). RDF-Connect may be considered as a native data stream processing framework with support for all types of transformations per record. The work of Dias et al. [14] also provides a survey of stream processing frameworks with a focus on resource elasticity (i.e., the capacity to automatically scale resources based on the workload). The authors highlight the need for high-level abstractions to facilitate the development of stream processing applications.

Some works that relate to workflow standards include the work of Dehury et al. [15], which introduces a data pipeline architecture for serverless platforms, based on the OASIS TOSCA (Topology and Orchestration Specification for Cloud Applications) standard 1 . The authors implement the proposed architecture using Apache NiFi components. The Common Workflow Language (CWL) is an open standard designed to describe the execution of command-line tools and their integration into workflows 2 . Its main components are command-line tools, which are essentially wrappers around commands (such as ls, echo, tar, etc.). CWL allows for the chaining of these tools to form workflows. Workflows themselves can also have inputs and outputs and can be nested within each other. The primary strength of CWL lies in its ability to chain the output of one tool to the input of another. Workflows described in CWL can be executed using different CWL runners. These range from the most basic cwltool3 , which runs workflows locally, to more advanced distributed computing platforms such as AWS and Azure via Arvados4 and others 5 .

Other related works include the work of Dessalk et al. [16] which proposes and implements an approach for big data workflows based on Docker containers which communicate using the messageoriented middleware KubeMQ. The authors also define a Domain Specific Language (DSL) for the description of workflows, although it seems not to be used in their implementation. In principle, this approach could be used to support multilingual pipelines, however the authors only implement individual data processing tasks using bash scripts. Lastly, Agrawal et al. [17] introduce RHEEM, a cross-platform data processing framework that decouples applications from underlying platforms. This tool allows to define data processing tasks from a limited number of mapping operators, which are then split and assigned to different execution environments based on a cost model. Although, this approach does not qualify as a general pipeline framework, it provides an interesting approach for multilingual and cross-platform data processing. A continuation of this work is currently being developed as an incubating Apache project called Apache Wayang [18].

Semantic-related data pipelines

Linked Pipes [19,20] (LP) was developed to provide a more user-friendly approach to creating Linked Open Data pipelines. Its primary objectives are to extract data, transform it into Linked Data, and load it into RDF triple stores. The tool features a web-based GUI that facilitates the easy construction and debugging of pipelines. Pipelines and processors in Linked Pipes are defined and configured using RDF. Each pipeline has a dereferenceable IRI, allowing it for example, to be imported into another LP-ETL instance. Linked Pipes is also built on the JVM, and all processors are implemented in JVM languages. While extending processors is possible, it demands considerable effort. Engineers also may find it challenging to adapt the generated RDF configurations for processors. Linked Pipes emphasizes the transportation of Linked Data between processors. It includes predefined processors such as extractors, transformers, loaders, as well as quality assessment and special processors. Quality Assessment processors ensure that messages are processed correctly and can halt the pipeline if issues are detected. Special processors can execute remote commands, adding to the system's flexibility. The main drawbacks of Linked Pipes is its lack of support for multilingual pipelines and its batch-based processing paradigm.

Grassi et al. [21] introduced Chimera, a semantic data transformation and integration tool based on Apache Camel. Camel is a Java-based integration framework that provides a set of predefined components, that can be extended for custom tasks. Camel can support multilingual pipelines through WebAssembly, which adds an additional layer of complexity. Although Camel can also support streamingbased operations, the authors relied on batch processing for Chimera. Guash et al. [22] presented a pipeline for RDF knowledge graph construction that uses Apache Airflow workflow as a workflow orchestrator together with Celery 6 . The authors argue in favor of using Airflow due to its scalability, the capacity of defining tasks in code and being open source. Another example of a semantic framework for pipelines is UnifiedViews [23], a batch-based ETL framework with native support for RDF data, that allows wrapping existing Java libraries as data-processing units and compose pipelines using a graphical user interface. The open source version of UnifiedViews seems not to be maintained anymore, in favor of a commercial version integrated to the PoolParty Semantic Suite 7 .

Bonte et al. [24] presented a survey of stream reasoning systems and specify a lifecycle model for Streaming Linked Data. The lifecycle model consists of 6 stages:

1. Name: Identify data streams with an IRI. 2. Model: Use a data model that accounts for both data and metadata. 3. Shape: Define the smallest unit of data in a stream (e.g., triple-based or graph-based). 4. Annotate: Transform non-RDF data into RDF. 5. Describe: Include interoperable metadata for discovery purposes. 6. Serve: Define format and protocol for data sharing. 7. Query: Consume the data stream via a querying process.

Although this work does not provide a specific pipeline framework, the lifecycle model can be used to define a set of processors for a streaming pipeline. Klironomos et al. [25] introduced ExeKGLib, a Python library to build ML pipelines described as a knowledge graph. The library is able to generate the corresponding Python scripts from the knowledge graph description, based on a set of data science operations (e.g, feature engineering, visualization, model training, etc). Pipeline executions are done in a batch-based manner and so far, there is only support for Python-based operations. Sicilia et al. [26] introduced an ontology that attaches semantic descriptions to data and analytic transformations that are integrated with data pipeline code. Concretely they apply this approach to data pipelines from the Apache Beam framework. Lastly, Chakraborty et al. [27] present a survey of ETL tools for semantic data integration. It concludes with a set of open challenges for the field, such as the need for automation and visualization tools.

RDF-Connect

RDF-Connect defines a declarative and streaming pipeline framework for cross-environment data processing. It aims to enhance interoperability among data processing libraries that excel in specific tasks. This necessity is particularly prominent within the Linked Data ecosystem due to the limited number of robust implementations. RDF libraries are primarily written in three major programming languages: JavaScript, Java, and Python. Developing software at a high TRL (Technology Readiness Level) demands significant effort. To advance the RDF ecosystem, it is inefficient to reimplement the same functionality across all programming languages. Instead, the best implementations should be able to interoperate within a single pipeline. RDF-Connect addresses this interoperability challenge by defining a common vocabulary to describe data processing units and their interactions, and implementing abstraction layers per execution environment that conceals the underlying programming language specifics of a given data processing unit.

For instance, consider SHACL validation as an example. The task is to determine whether a given RDF graph is validated according to a given SHACL shape. While the SHACL specification clearly defines the requirements for a SHACL engine implementation, providing full specification coverage can be challenging for developing engines from scratch. Performance is another critical factor; the validation step must not become a bottleneck in the pipeline, which is a risk if a validator is implemented in a slower programming language. With RDF-Connect, only a single high-TRL SHACL validation implementation is necessary, so that high performance validation can become possible across different pipelines.

Another example of the value of RDF-Connect is evident when integrating a pipeline into an organisation's infrastructure and some custom code is necessary. This code, being specific to the organisation, is unlikely to be reused by others and thus can be written in a language that facilitates rapid development and suits developers expertise. RDF-Connect mitigates the tension between application speed and development speed.

General idea

RDF-Connect focuses on three main objectives:

1. The choice of programming language should not impair the pipeline. 2. New components should be easy to build. 3. Pipelines should be able to be validated before running.

To achieve these objectives, RDF-Connect is divided into four components:

1. Processors: These are small execution units that perform a single and usually simple task to foster reusability. 2. Runners: Runners target a specific execution environment (like Javascript) in which they initiate one or more processors with the provided pipeline configuration. They also construct the configured channels before starting the pipeline. 3. Channels: These are abstract entities responsible for data transfers among processors. 4. Pipeline Configuration: An RDF document describing various interlinked processor instances working together to accomplish a task. Figure 1 shows the main concepts of the RDF-Connect data model, published online at https: //w3id.org/rdf-connect. Language-specific Processors are executed by language-specific Runners, which support specific types of streaming communication Channels. SHACL shapes 8 accompany concrete processor and channel definitions providing descriptions for input and output parameters and their constraints. The vocabulary can be easily extended to define coverage for additional execution environments and communication means. Programming language barriers are overcome by splitting logic and sending/receiving data over channels. Channels are programming language-independent and based on well-known communication protocols, such as HTTP request/response interactions, since most languages provide support for sending HTTP messages and starting an HTTP endpoint. However, channel logic is abstracted from processors so that each processor acts on event-driven messages regardless of how those messages are delivered. Each type of channel has its advantages and disadvantages. HTTP channels are easy to set up and relatively fast but lack replay-ability and fault tolerance. Kafka streams, for example, are another option that offers replay-ability and logging, offering a more feature-rich type of channel to connect processors. For consecutive processors written in the same language, in-memory channels can be used share data efficiently. Note that this does not compromise the first objective; all pipelines remain configurable, although some configurations may offer some trade-offs in terms of performance. As mentioned before, processors are channel-agnostic to ensure maximum reusability, allowing the pipeline manager to choose the optimal channel for each step. A complete list of supported channels is described in Section 3.2.

Creating a new processor is language-dependent and should be approached from the corresponding runner's perspective. For instance, to run a JavaScript processor, only the JavaScript file and the processor function are required. This simplicity facilitates the creation of new processors, as each processor merely needs to describe the file, the function and annotate the required parameters. This information ensures consistent and reliable initiation of the processor. Each processor should be accompanied by a SHACL shape, which runners can use to determine the processor's parameters. SHACL shapes enable the validation of a pipeline before it is executed. Additionally, these shapes and processor configurations allow for the derivation of provenance information from pipelines.

A pipeline is runner-agnostic, which makes it possible to refer to a pipeline, without knowing the underlying implementation, or the underlying runners. All information required for the runner to start a processor is already present in the processor definition. Listing 1 shows a pipeline with three processors of type rdfc-js:Send, rdfc-js:Append and rdfc-jvm:Print. The purpose of the pipeline is to send the message "Hello" from the first processor to the second, which appends the message with "World" and sends it to the third processor, which in turn prints the message in the console. This pipeline should be started with the JVM and the JavaScript runner. The runners expands all owl:import statements to find the required processor definition files, both locally and remotely. Each runner starts all processors that are defined against that runner. When everything is started, the application runs.

JavaScript runner, processors and channels

As mentioned before, a runner is the abstraction layer between the executing processors and the complete pipeline that constitutes the application. It handles parsing arguments, setting up channels and providing useful feedback when the execution fails. We implemented a JavaScript runner that can execute a JavaScript function as a processor. The code is open source and available with an MIT license 9 .

This was our first runner and gave us initial experience in understanding what channels were useful. We implemented the following channels for the JavaScript Runner:

• HTTP: The HTTP channel is the main channel used for crossing language barriers. The only configuration the HTTP channel needs is a port and a host to start the endpoint. When the channel is set up, messages can cross programming language barriers. • Websocket: The WebSocket channel is very similar to the HTTP channel but might improve throughput slightly. Note that web sockets are bidirectional, but the reverse direction is not used, as this would break channel-agnostic implementations. • File: The file channel is an interesting channel. It allows communication over files. Writing to this channel is writing a file and the reader part of the channel can use file watcher to detect changes. 9 https://github.com/rdf-connect/js-runner/tree/rdf-connect-paper Listing 1: Example pipeline that defines three processors. The processors of type rdfc-js:Send and rdfc-js:Append communicate with each other using a JavaScript in-memory channel (rdfc-js:JSReaderChannel and rdfc-js:JSWriterChannel), which suggest that the processors are implemented in JavaScript. The third processor of type rdfc-jvm:Print communicates with the processor of type rdfc-js:Append using an HTTP channel which uses an HTTP POST to send data. The processors' descriptions are found in the external documents imported by the owl:imports property.

A change is interpreted as a new message. This channel is particularly useful for integrating configuration files in an idiomatic way into RDF-Connect. • Kafka: The Kafka channel is the only channel that has replay-ability and fault tolerance. If some part of the pipeline breaks, the messages are not lost but still exist on the Kafka topic created by the channel. It also facilitates to integrate a pipeline with already existing Kafka streams. • In-memory: This channel is the main channel used to let JavaScript processors communicate with each other. It is very fast and easy to use as the message never leaves the JavaScript heap and avoids data copying.

The JavaScript runner requires all processors to denote the location of their implementation file and function name. This gives enough information to import the file and call the function at runtime. Each processor also comes with a SHACL shape denoting the required parameters. In previous iterations, we used the Function Ontology (FnO) parameter mappings for this but it was found too verbose to write and has been deprecated (although some processor description may still use it). Now each processor gets a single argument built with RDF Lens10 derived from the SHACL shape. RDF Lens gives great flexibility in the objects that are used as arguments. An example of a processor definition for the JavaScript runner can be found on Listing 2, the corresponding implementation is found at Listing 3. The JavaScript runner works in three steps, first, it loads the pipeline to determine all configured JavaScript processors and try to extract their arguments. Then each processor is started one by one. Each processor can return a function that will be called when all processors are started, this alleviates race conditions and allows for particular start-up sequences, like connecting with a database. Then all functions returned by processors are called and the pipeline starts its execution.

Targeting the JVM

The Java ecosystem for RDF contains many mature, performant and feature rich RDF applications and libraries, making it a valuable target for RDF-Connect. We set out to build a runner which bridges these applications with those targeting JavaScript, in order to greatly increase the potential for future pipelines. We have written a proof-of-concept runner using the JVM-compatible language Kotlin11 , capable of executing not only processors written in Kotlin or Java, but essentially all languages which are capable of generating JVM class files, such as Scala.

In contrast to the JavaScript runner, we do not implement our processors as a single function. Instead, we have adopted a more traditional object-oriented approach, defining a common abstract class which all processors must inherit. For example, we have delegated initialization to the extendable constructor, and execution of the main logic into to the abstract exec method. Handling state and complex logic is more straightforward in this manner, since many additional methods and fields can be defined as desired. Arguments are passed to the constructor as a simple string-to-object map wrapper, removing the requirement for a specific method signature. Additionally, using Kotlin's reified generic types, we provide a convenient and simple API which allows for type-safe retrieval of arguments at runtime using compiler inferred types. Channels are implemented much like their JavaScript counterparts, but instead of relying on promises and callbacks, we make heavy use of the Kotlin Coroutine library. Most of the communication here is implemented using Kotlin's channel primitives, which are highly similar to those more popularly found in the Go language. Interoperability with HTTP, Kafka, and so forth, is simply implemented as a collection of data producers and consumers of those channel primitives. As a result, the actual data source and destination are completely abstracted by a common reader and writer interface.

The design decisions above showcase how developers of new RDF-Connect runners have a large degree of freedom implementation-wise. Despite these differences, declaring processors is still done in a similar manner to the JavaScript runner, including SHACL shapes defining the required arguments. Adding a processor as part of a pipeline is identical to the JavaScript runner, as the pipeline configuration is runner-agnostic.

In its current state, coordinating the two runner must be done manually, since one runner cannot start or terminate another. Executing a pipeline spanning multiple runners typically requires a simple shell script which starts each runner individually, all pointing to the same configuration file. This, as well as other limitations and opportunities, are a key focus of our future work.

NiFi runner: interacting with other frameworks

Apache NiFi is not a programming language, but a pipelining tool to design an automated dataflow pipeline, with a focus on the ability to operate within clusters, security using TLS encryption, and extensibility. Apache NiFi also has an API, which can be used to list processors and build pipelines. This is enough to build a NiFi runner. In a previous version of RDF-Connect, we built a NiFi runner that can generate processor descriptions based on NiFi processors retrieved with the API. These processors could be used within a pipeline just like any other processor To instantiate those processors, the NiFi runner made the correct API calls on the NiFi cluster. Just like the JavaScript Runner has in-memory channels, the NiFi Runner also had NiFi channels as its main way of communication. Other channels like HTTP, were also supported by instantiating a NiFi template that worked like an HTTP endpoint in other languages. We dropped support for NiFi in newer versions because we had no demand for it within our use cases at the moment. However, it was shown that it was feasible and bringing back support for it remains possible. The original code is still available on github 12 .

Interoperable pipeline

Pipeline structure

A visual representation of the pipeline can be seen in Figure 2. The pipeline contains multiple processors working together to achieve the goal. The blue hexagons represent incoming data, while the green hexagon represents the output. The first processor creates RML [28] from YARRRML [29] rules, a more user-friendly notation for mappings 15 . The processor is implemented in JavaScript and relies on the yarrrml-parser library written in JavaScript 16 . The YARRRML file is sourced from a GlobRead processor, which has a single function: reading files according to a specified glob pattern 17 .

The next processor executes an RML mapping for each incoming JSON object received from The Things Network (TTN). This is a JavaScript processor that wraps around the RMLMapper JAR file. Attempts to implement this processor in Java were unsuccessful due to various difficulties using the application code as a library 18 . If the GlobRead processor finds two files, the RML mapper processor executes both mappings.

The third processor checks if the generated RDF is valid according to KWG-SHACL 19 . This step is crucial, as invalid data should not contaminate the triple store. SHACL validation is implemented in multiple languages, and with RDF Connect, it is easy to swap one implementation for another. This pipeline was executed twice: once with a JavaScript processor wrapping around rdf-validate-shacl 20 , and once with a Java processor wrapping around org.apache.jena.shacl 21 . The shape had to be adapted in minor ways to allow for simple RDF generation and allow for differences in the validator processors. For example, the observations are observed by a sensor, but only a stub for the sensor is generated using our simple mappings. This results in invalid objects, but to demonstrate the functionality, we altered the shape file.

Since the JavaScript SHACL validator processor did not exist, we developed a new processor to integrate into the pipeline. The processor is minimal, comprising only 25 lines of JavaScript code and 40 lines of RDF configuration. This experience highlights how straightforward it is to create new processors, reinforcing the practice of building small, reusable processors that perform a single task efficiently.

The final processor is again a custom processor, creating a SPARQL INSERT query from the incoming triples, optionally placing the triples in a specific graph and executing the SPARQL INSERT query on a triple store using the fetch-sparql-endpoint library.

Mapping entities

Integrating the pipeline into the real world is accomplished using channels. New data is ingested via a POST request at the beginning of the pipeline. With The Things Network, one specifies a webhook that sends a JSON object for each measurement made by sensors. This JSON object contains information about the node that made the observation as well as a decoded payload, which is custom for each data logger node. An example of our payload is shown in Listing 5. Our YARRRML mapping file reflects this structure by generating a sosa:Observation for each field in the payload separately. To test our pipeline, we intentionally made a mistake in the mapping for humidity, creating invalid sosa:Observations. When running the pipeline, the configured triple store will not contain any humidity observations as these are rejected by the SHACL validator.

Swapping a processor

Currently, all processors are JavaScript-based. This simplifies running the pipeline; a single command, npx js-runner pipeline.ttl, initiates and executes the pipeline. However, if we are not satisfied with the current SHACL validator (e.g., for performance reasons), it can be easily exchanged by an alternative implementation, such as a Java-based processor mentioned above.

This can be achieved by modifying the configuration to utilize a Java processor instead of the JavaScript processor. The Java runner cannot integrate JavaScript in-memory channels, so the incoming and outgoing channels need to be changed to, for example, HTTP channels. Both the JavaScript runner and the Java runner will establish HTTP endpoints, allowing for message transfer across the language barrier. The concept of the configuration remains unaffected by these changes.

This example, although artificially fabricated, effectively demonstrates the versatility of RDF-Connect. The ability to effortlessly swap out a processor to verify an implementation or replace it with a more performant processor is advantageous. Additionally, file channels exist, which read and write to a file. This feature is invaluable for debugging the pipeline, as it allows for immediate input and output verification by writing to and reading data to/from disk.

Conclusion and Future Work

In this paper we introduced our ongoing work on RDF-Connect, a declarative and streaming pipeline framework. We showed how RDF-Connect is able to abstract programming language specifics and enable communication of data processing functions, implemented in different languages over languageagnostic channels. In this way we accomplish our goals of providing support for multi-lingual and streaming pipelines. We also showed an example of how we are applying RDF-Connect in a real use case, where we are able to load and validate incoming sensor data into an RDF triple store.

As previously mentioned, executing pipelines which span multiple runtimes and environments is currently not as simple as launching a simple binary. More importantly, the development of runners targeting new runtimes and languages is also limited due to the high complexity of parsing the configuration file, setting up inter-process communication channels and routing messages. Therefore, we wish to greatly simplify RDF-Connect by designing and developing a new platform-agnostic orchestrator, which will do much of the heavy lifting currently required by all runners individually. The orchestrator should be responsible for starting the runners, either locally or remote, and facilitate communication between them. This leaves only one channel type without configuration, reducing the complexity of pipeline configurations. Reducing the complexity of the pipeline also reduces the potential for human errors. Finally, individual runners should not need to interpret the pipeline configurations themselves. Rather, we will map the RDF model to a simple and intuitive configuration representation at runtime in order to relief the individual processors of building and querying their own triple store. By doing so, we want to make it significantly easier to bring RDF-Connect to new runtimes and environments, such as low-level languages like Rust, to take advantage of system-level performance gains.

We will also set out to define a standard way of publishing and finding processors, greatly improving convenience and ease-of-use, to facilitate an ever-growing ecosystem of processors.

Figure 1 :1Figure 1: Main classes of the RDF-Connect vocabulary. Two language-specific processor definitions are shown for JavaScript and JVM-based environments, for which corresponding implementations are currently available.

1 rdfc-1js:Resc a rdfc-js:Processor; 2 rdfc-js:file <./test.js>; 3 rdfc-js:function "resc"; 4 rdfc-js:location <./>.

Listing 2 : 4 }); 5 } 3 :2453Example configuration of JavaScript processor including location, file, function name and required argument shape. 1 export function resc(args) { 2 args.input.data((input) => { 3 console.log(args.msg, input)Listing Example implementation of JavaScript processor.

1 class1Transparent(args: Arguments) : Processor(args) { 2 private val input: Reader = arguments["input"] 3 private val output: Writer = arguments["output"]

4 5override 7 } 8 }478suspend fun exec() { 6 output.push(input.read())Listing 4: A simple Kotlin processor which reads data from an incoming channel and pushes it verbatim to a writer.

Figure 2 :2Figure 2: Visual representation of our example pipeline, loading sensor data from The Things Network into a triple store

1 { 2 " 7 } 5 :1275Listing Example of the decoded payload in a TTN message https://docs.oasis-open.org/tosca/TOSCA/v1.0/os/TOSCA-v1.0-os.pdf https://www.commonwl.org/ https://github.com/common-workflow-language/cwltool https://arvados.org/ https://www.commonwl.org/implementations/ a distributed python-based task-queue system: https://github.com/celery/celery https://www.poolparty.biz/poolparty-unifiedviews https://github.com/ajuvercr/rdf-lens The code is open source and available with an MIT license https://github.com/rdf-connect/orchestrator/tree/ rdf-connect-paper This section covers a real-world pipeline that loads incoming sensor data into a triple store. The sensor data originates from The Things Network13 , a global collaborative Internet of Things ecosystem using LoRaWAN®. The pipeline and its installation instructions can be found on GitHub14 .12 https://github.com/ajuvercr/nifi-runner13 https://www.thethingsnetwork.org/ 14 https://github.com/ajuvercr/rdf-connect-paper-pipeline

Acknowledgements

This work was supported by the MAREGRAPH project (id: 101100771) which is co-funded by the European Union under the Digital Europe Programme.

Modelling data pipelines ARaj JBosch HHOlsson TJWang 10.1109/SEAA51224.2020.00014 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2020 The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the SBiswas MWardat HRajan 10.1145/3510003.3510057 Proceedings of the 44th International Conference on Software Engineering, ICSE '22 the 44th International Conference on Software Engineering, ICSE '22

New York, NY, USA

Association for Computing Machinery 2022 Data lifecycle challenges in production machine learning: A survey NPolyzotis SRoy SEWhang MZinkevich 10.1145/3299887.3299891 SIGMOD Rec 47 2018 Knowledge graph lifecycle: Building and maintaining knowledge graphs USimsek KAngele EKärle JOpdenplatz DSommer JUmbrich DFensel Proceedings of the 2nd International Workshop on Knowledge Graph Construction (KGC 2021) co-located with 18th Extended Semantic Web Conference the 2nd International Workshop on Knowledge Graph Construction (KGC 2021) co-located with 18th Extended Semantic Web Conference

ESWC

2021. 2021 Generalizing streaming pipeline design for big data KRengarajan VKMenon Machine Intelligence and Signal Processing SAgarwal SVerma DP

Singapore, Singapore

Springer 2020 An overview of current trends in data ingestion and integration THlupić JPuniš 10.23919/MIPRO52101.2021.9597149 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO) 2021 A survey of distributed data stream processing frameworks HIsah TAbughofa SMahfuz DAjerla FZulkernine SKhan 10.1109/ACCESS.2019.2946884 IEEE Access 7 2019 Benchmarking distributed stream processing platforms for iot applications AShukla YSimmhan Performance Evaluation and Benchmarking. Traditional -Big Data -Internet of Things RNambiar MPoess

Cham

Springer International Publishing 2017 Subtle bugs everywhere: generating documentation for data wrangling code CYang SZhou JL CGuo CKästner 10.1109/ASE51524.2021.9678520 Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering, ASE '21 the 36th IEEE/ACM International Conference on Automated Software Engineering, ASE '21 IEEE Press 2022 Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers HFoidl VGolendukhina RRamler MFelderer 10.1016/j.jss.2023.111855 J. Syst. Softw 207 2024 A survey of big data pipeline orchestration tools from the perspective of the datacloud project MMatskin STahmasebi ALayegh APayberah AThomas NNikolov DRoman 10.5281/zenodo.5910891 doi: CEUR Workshop Proceedings

RCN)

2021 Research Council of Norway AMbata YSripada MZhong arXiv:2406.08335 A survey of pipeline tools for data engineering 2024 arXiv preprint The big data system, components, tools, and technologies: a survey TRRao PMitra RBhatt AGoswami 10.1007/s10115-018-1248-0 Knowledge and Information Systems 60 2019 Distributed data stream processing and edge computing: A survey on resource elasticity and future directions MDias De Assunção ADa Silva RVeith Buyya 10.1016/j.jnca.2017.12.001 doi: Journal of Network and Computer Applications 103 2018 Data pipeline architecture for serverless platform CDehury PJakovits SNSrirama VTountopoulos GGiotis Software Architecture

Cham

Springer International Publishing 2020 Scalable execution of big data workflows using software containers YDDessalk NNikolov MMatskin ASoylu DRoman 10.1145/3415958.3433082 Proceedings of the 12th International Conference on Management of Digital EcoSystems, MEDES '20 the 12th International Conference on Management of Digital EcoSystems, MEDES '20

New York, NY, USA

Association for Computing Machinery 2020 Rheem: enabling cross-platform data processing: may the big data be with you! DAgrawal SChawla BContreras-Rojas AElmagarmid YIdris ZKaoudi SKruse JLucas EMansour MOuzzani PPapotti J.-AQuiané-Ruiz NTang SThirumuruganathan ATroudi 10.14778/3236187.3236195 Proc. VLDB Endow 11 2018 Apache wayang: A unified data analytics framework KBeedkar BContreras-Rojas HGavriilidis ZKaoudi VMarkl RPardo-Meza J.-AQuiané-Ruiz 10.1145/3631504.3631510 SIGMOD Rec 52 2023 Linkedpipes etl: Evolved linked data preparation JKlímek PŠkoda MNečaský The Semantic Web HSack GRizzo NSteinmetz DMladenić SAuer CLange

Cham

Springer International Publishing 2016 Linkedpipes etl in use: practical publication and consumption of linked data JKlímek PŠkoda 10.1145/3151759.3151809 Proceedings of the 19th International Conference on Information Integration and Web-Based Applications & Services, iiWAS '17 the 19th International Conference on Information Integration and Web-Based Applications & Services, iiWAS '17

New York, NY, USA

Association for Computing Machinery 2017 Composable semantic data transformation pipelines with chimera MGrassi MScrocca ACarenini MComerio ICelino Proceedings of the 4th International Workshop on Knowledge Graph Construction (KGC 2023) co-located with 20th Extended Semantic Web Conference the 4th International Workshop on Knowledge Graph Construction (KGC 2023) co-located with 20th Extended Semantic Web Conference

ESWC

2023. 2023 Semantic knowledge graphs for distributed data spaces: The public procurement pilot experience CGuasch GLodi SVDooren The Semantic Web -ISWC 2022

Cham

Springer International Publishing 2022 Unifiedviews: An etl framework for sustainable rdf data processing TKnap MKukhar BMacháč PŠkoda JTomeš JVojt The Semantic Web: ESWC 2014 Satellite Events

Cham

Springer International Publishing 2014 Streaming linked data: A survey on life cycle compliance PBonte RTommasini 10.1016/j.websem.2023.100785 Journal of Web Semantics 77 100785 2023 Exekglib: Knowledge graphs-empowered machine learning analytics AKlironomos BZhou ZTan ZZheng G.-EMohamed HPaulheim EKharlamov The Semantic Web: ESWC 2023 Satellite Events

Nature Switzerland; Cham

Springer 2023 Ontologies for data science: On its application to data pipelines M.-ÁSicilia EGarcía-Barriocanal SSánchez-Alonso MMora-Cantallops J.-JCuadrado Metadata and Semantic Research EGaroufallou FSartori RSiatri MZervas

Cham

Springer International Publishing 2019 Semantic etl -state-of-the-art and open research challenges JChakraborty APadki SKBansal 10.1109/ICSC.2017.94 IEEE 11th International Conference on Semantic Computing (ICSC) 2017. 2017 RML: a generic language for integrated RDF mappings of heterogeneous data ADimou MVander Sande PColpaert RVerborgh EMannens RVan De Walle Proceedings of the 7th Workshop on Linked Data on the Web CBizer THeath SAuer TBerners-Lee the 7th Workshop on Linked Data on the Web 2014 1184 CEUR Workshop Proceedings Declarative rules for linked data generation at your fingertips! PHeyvaert BDe Meester ADimou RVerborgh The Semantic Web: ESWC 2018 Satellite Events

Cham

Springer International Publishing 2018