<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. R. Rao, P. Mitra, R. Bhatt, A. Goswami, The big data system, components, tools, and tech-
nologies: a survey, Knowledge and Information Systems</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.jss.2023.111855</article-id>
      <title-group>
        <article-title>RDF-Connect: A declarative framework for streaming and cross-environment data processing pipelines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arthur Vercruysse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Pots</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julián Rojas Meléndez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Colpaert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IDLab, Department of Electronics and Information Systems, Ghent University - imec</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>60</volume>
      <issue>2019</issue>
      <fpage>63</fpage>
      <lpage>78</lpage>
      <abstract>
        <p>Data processing pipelines are a crucial component of any data-centric system today. Machine learning, data integration, and knowledge graph publishing are examples where data processing pipelines are needed. Furthermore, most production systems require data pipelines that support continuous operation and streaming-based capabilities for low-latency computations over large volumes of data. However, creation and maintenance of data processing pipelines is challenging and a lot of efort is usually spent on ad-hoc scripting, which limits reusability across systems. Existing solutions are not interoperable out-of-the-box and do not allow for easy integration of diferent execution environments (e.g., Java, Python, JavaScript, Rust, etc), while maintaining a streaming operation. For example, combining Python, JavaScript and Java-based libraries natively in a single pipeline is not straightforward. An interoperable and declarative mechanism could allow for continuous communication and integrated execution of data processing functions across diferent execution environments. We introduce RDF-Connect, a declarative framework based on semantic standards that enables instantiating pipelines with data processing functions across execution environments communicating through well-known communication protocols. We describe its architecture and demonstrate its use for an RDF knowledge graph creation, validation and publishing use case. The declarative nature of our approach facilitates reusability and maintainability of data processing pipelines. We currently support JavaScript and JVM-based environments but we aim to extend RDF-Connect support to other rich ecosystems such as Python and to lower-level languages such as Rust, to take advantage of system-level performance gains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data pipeline</kwd>
        <kwd>RDF</kwd>
        <kwd>Streaming</kwd>
        <kwd>Interoperability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Modern data-centric software systems are built supported on complex data processing operations. The
design and implementation of these operations are organized into data pipelines, which are sequences
of data processing components where the output of one component is the input of the next, thus
enabling smooth data flows towards a common goal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Data pipelines are present at the core of
datadependent tasks such as data science, where pipelines are used for acquisition, curation and analysis
of data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; machine learning (ML), where pipelines support the preparation, training, validation and
cleaning of data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; knowledge graph construction and publishing, where pipelines are used to generate
semantic annotations from heterogeneous sources, validate (e.g., with SHACL) and load data into a
graph database [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Recently, the need for streaming data pipelines has increased due to the demand for low-latency
computations. Production systems often need to process high volumes of data at high velocity, rendering
traditional batch processing methods insuficient [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Batch processing systems sufer from latency
issues since they need to collect input data into batches before it can be processed any further [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Streaming data pipelines are designed to continuously process data as soon as possible, which is required
for (near) real-time analytics, monitoring and reacting to changes in data sources. Internet of Things
(IoT) scenarios are a typical example where streaming data pipelines are essential [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Despite how much data pipelines can aid on data management challenges through automation,
monitoring, fault detection, etc; modeling and implementing data pipelines often require significant
efort and expertise [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, data pipelines are often implemented as ad-hoc scripts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which are
error-prone and dificult to maintain and reuse across diferent systems [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. A myriad of data pipeline
frameworks exist, each with its own strengths and weaknesses [
        <xref ref-type="bibr" rid="ref7">11, 7, 12, 13</xref>
        ]. However, most of these
frameworks are not interoperable out-of-the-box (or only provide support for a few other specific ones)
and do not allow for easy integration of diferent execution environments. For instance, combining
Python, JavaScript and Java-based data processing libraries natively (i.e., each executed in their own
native environment) in a single pipeline is not straightforward. This is particularly important, for
example, in the case of RDF knowledge graph processing pipelines, where despite the maturation of RDF
libraries (e.g., Apache Jena, RDFLib, RDF-JS-based libraries, etc), robust implementations of well-defined
operations, like RDF generation or SHACL validation, are still scarce and are only available for few
languages. This raises the need to connect the best implementations for each operation seamlessly, and
while shell scripts can facilitate this, they are dificult to maintain and reuse.
      </p>
      <p>The need for multilingual pipelines also arises when certain critical parts of a pipeline demand
substantial resources. A pipeline should manage the required load eficiently, If a component is not
eficient, it should be replaceable with a more performant version in a diferent language. The challenge
lies not in handling the data across diferent programming languages but in connecting these languages
efectively. Conversely, not all pipeline components are critical and may adhere to diferent standards.
Interacting with a complex internal API to aggregate data may be more easily implemented using the
company’s primary languages, leveraging existing SDKs and expertise. In general, a pipeline should
not mandate a particular language for specific tasks.</p>
      <p>Designing an interoperable and declarative framework that allows for easy description,
continuous communication and integrated execution of data processing functions across diferent execution
environments, is the main goal of this work. The framework should allow for clear separation of
concerns (i.e., high-level workflow definition from step-level implementation and deployment details),
while focusing on the reusability of data pipelines [11], that could allow for easy deployment of new
and similar data pipelines with minimal efort and adjustments. We also aim on facilitating testing,
benchmarking and comparison of diferent data processing tools implemented in diferent languages.</p>
      <p>With that goal in mind, we set out to create a declarative streaming pipeline architecture to address
the previously mentioned challenges.</p>
      <p>1. Streaming: The architecture should focus on streaming pipelines, allowing for continuous data
processing.
2. Multi-lingual: The architecture should not be built around a single language and should abstract
away language specifics.</p>
      <p>In this paper, we introduce RDF-Connect, a declarative, multilingual and streaming pipeline
framework. The framework defines a simple RDF vocabulary and architecture that abstracts languages by
using language-specific Runners, which are able to communicate over language-agnostic Channels.
These channels enable the streaming of data as messages, facilitating the construction of streaming
pipelines. The pipeline and its components (Processors) are described using SHACL shapes, which
allows to declare the expected input and output data types and constraints of each processor. To
demonstrate the fulfillment of the proposed requirements, we describe a real use case, which consists of
a pipeline that annotates, validates and loads incoming sensor data into an RDF triple store. Currently
RDF-Connect supports JavaScript and JVM-based environments. In the future, we aim to extend it to
support rich environments such as Python and lower-level languages such as Rust, to take advantage of
system-level performance gains.</p>
      <p>The remainder of the paper is structured as follows. First, we present an overview of related work,
noting the ubiquity of pipelining frameworks. Next, we describe RDF-Connect, discussing various design
decisions and the provision of multiple runners to support multi-lingual pipelines. The subsequent
section4 elaborates on an existing pipeline used in the field, addressing some of the hurdles that were
overcome. The final sections cover the conclusion and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>In this section, we present an overview related works that address the requirements, characteristics,
and challenges of data processing pipelines.</p>
      <sec id="sec-2-1">
        <title>2.1. Data pipelines</title>
        <p>
          Foidl et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] provide a comprehensive overview of data pipelines, starting with a definition where a
data processing pipeline can be understood from a theoretical perspective, as a directed acyclic graph
(DAG) composed of a sequence of nodes that process data; and from a practical perspective, as a piece
of software that automates the manipulation of data and moves them from diverse source systems
to defined destinations [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The authors also define a generic architecture for data pipelines, which
consists of three main components:
1. Data Source: The origin of the data, which can be a database, a file, a message queue, or a Web
        </p>
        <p>API.
2. Data Processing: The transformation of the data, which can include filtering, aggregation,
enrichment, and validation.
3. Data Sink: The destination of the data, which can be other pipelines, an application or external
storage systems.</p>
        <p>
          Lastly, they mention a classification of data pipelines based on their processing characteristics, such
as processing mode (batch or streaming), data flow (ETL (Extract-Transform-Load) or ELT
(ExtractLoad-Transform)), and use case (visualization, analysis tools, ML and deep learning applications, or
data mining) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>Similarly, Matskin et al. [11] provide a survey of big data pipeline orchestration tools. They propose
a set of criteria for evaluating pipeline tools and frameworks, emphasizing reusability, flexible pipeline
communication modes, and separation of concerns. The authors use these criteria to analyze 37 diferent
tools for big data pipeline orchestration. They also note that few tools support a graphical input language
for the description of pipelines, and they conclude that Apache Airflow (open-source), Argo Workflow
and Snakemake (closed-source) are the best suited frameworks for their needs, despite Airflow failing
to fullfil the separation of concerns criterion and lacking support for streaming processing [ 11]. Among
the criteria, the authors do not directly consider the support for multilingual pipelines, however
the step-level containerization could be seen as a way to support diferent languages. Thanks to its
language-oriented Runners, RDF-Connect can support native multilingual execution without resorting
to containerization, although pipeline-level containerization can be used for facilitating dependency
management.</p>
        <p>Mbata et al. [12] also provide a survey of pipeline tools for data engineering. In this survey, pipelines
are also categorised according to their characteristics, such as (i) ETL/ELT pipelines, (ii) data integration,
ingestion and transformation pipelines, (iii) orchestration and workflow management pipelines, and
(iv) ML pipelines. Apache Spark-based pipelines are highlighted as a popular choice for ETL/ELT
data processing with support for Scala, Java, Python and R, although its performance is compromised
when used for large pipelines. Apache Kafka and Apache NiFi are mentioned as streaming integration
tools. Kafka provides interfaces for multiple languages and is streaming oriented but entails a steep
learning curve. NiFi is user-friendly and supports live streaming pipelines, but is mainly Java-based. For
orchestration and workflow management, Apache Airflow and Apache Beam are highlighted. Apache
Airflow and Apache Beam primarily concentrate on running pipelines defined in a single language across
diferent execution environments, with a particular emphasis on big data. Although Beam supports
multilingual pipelines, its implementation is highly verbose. Beam relies on job servers to execute parts
in a diferent language which has to be set up before starting the pipeline. The connections to these job
servers are complex to code and challenging to modify. On the other hand, Apache Airflow does not
support declarative pipelines, streaming data, nor multilingual pipelines (being mainly Python-oriented).</p>
        <p>
          Related works focused on streaming pipelines include the work of Isah et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] which surveys
distributed data stream processing frameworks such as Apache Flink, Apache Storm, Apache Spark
Streams, and Apache Kafka Streams. They define a taxonomy for these frameworks based on their
characteristics. For instance the programming model (native or micro-batch) and the type of
transformations supported per record (Map, Filter and FlatMap). RDF-Connect may be considered as a native data
stream processing framework with support for all types of transformations per record. The work of Dias
et al. [14] also provides a survey of stream processing frameworks with a focus on resource elasticity
(i.e., the capacity to automatically scale resources based on the workload). The authors highlight the
need for high-level abstractions to facilitate the development of stream processing applications.
        </p>
        <p>Some works that relate to workflow standards include the work of Dehury et al. [ 15], which introduces
a data pipeline architecture for serverless platforms, based on the OASIS TOSCA (Topology and
Orchestration Specification for Cloud Applications) standard 1. The authors implement the proposed
architecture using Apache NiFi components. The Common Workflow Language (CWL) is an open
standard designed to describe the execution of command-line tools and their integration into workflows 2.
Its main components are command-line tools, which are essentially wrappers around commands (such
as ls, echo, tar, etc.). CWL allows for the chaining of these tools to form workflows. Workflows
themselves can also have inputs and outputs and can be nested within each other. The primary strength
of CWL lies in its ability to chain the output of one tool to the input of another. Workflows described in
CWL can be executed using diferent CWL runners. These range from the most basic cwltool3, which
runs workflows locally, to more advanced distributed computing platforms such as AWS and Azure via
Arvados4 and others5.</p>
        <p>Other related works include the work of Dessalk et al. [16] which proposes and implements an
approach for big data workflows based on Docker containers which communicate using the
messageoriented middleware KubeMQ. The authors also define a Domain Specific Language (DSL) for the
description of workflows, although it seems not to be used in their implementation. In principle,
this approach could be used to support multilingual pipelines, however the authors only implement
individual data processing tasks using bash scripts. Lastly, Agrawal et al. [17] introduce RHEEM, a
cross-platform data processing framework that decouples applications from underlying platforms. This
tool allows to define data processing tasks from a limited number of mapping operators, which are then
split and assigned to diferent execution environments based on a cost model. Although, this approach
does not qualify as a general pipeline framework, it provides an interesting approach for multilingual
and cross-platform data processing. A continuation of this work is currently being developed as an
incubating Apache project called Apache Wayang [18].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Semantic-related data pipelines</title>
        <p>Linked Pipes [19, 20] (LP) was developed to provide a more user-friendly approach to creating Linked
Open Data pipelines. Its primary objectives are to extract data, transform it into Linked Data, and
load it into RDF triple stores. The tool features a web-based GUI that facilitates the easy construction
and debugging of pipelines. Pipelines and processors in Linked Pipes are defined and configured
using RDF. Each pipeline has a dereferenceable IRI, allowing it for example, to be imported into
another LP-ETL instance. Linked Pipes is also built on the JVM, and all processors are implemented
in JVM languages. While extending processors is possible, it demands considerable efort. Engineers
also may find it challenging to adapt the generated RDF configurations for processors. Linked Pipes
emphasizes the transportation of Linked Data between processors. It includes predefined processors
such as extractors, transformers, loaders, as well as quality assessment and special processors. Quality
Assessment processors ensure that messages are processed correctly and can halt the pipeline if issues
are detected. Special processors can execute remote commands, adding to the system’s flexibility. The
main drawbacks of Linked Pipes is its lack of support for multilingual pipelines and its batch-based</p>
        <sec id="sec-2-2-1">
          <title>1https://docs.oasis-open.org/tosca/TOSCA/v1.0/os/TOSCA-v1.0-os.pdf</title>
          <p>2https://www.commonwl.org/
3https://github.com/common-workflow-language/cwltool
4https://arvados.org/
5https://www.commonwl.org/implementations/
processing paradigm.</p>
          <p>Grassi et al. [21] introduced Chimera, a semantic data transformation and integration tool based
on Apache Camel. Camel is a Java-based integration framework that provides a set of predefined
components, that can be extended for custom tasks. Camel can support multilingual pipelines through
WebAssembly, which adds an additional layer of complexity. Although Camel can also support
streamingbased operations, the authors relied on batch processing for Chimera. Guash et al. [22] presented a
pipeline for RDF knowledge graph construction that uses Apache Airflow workflow as a workflow
orchestrator together with Celery6. The authors argue in favor of using Airflow due to its scalability,
the capacity of defining tasks in code and being open source. Another example of a semantic framework
for pipelines is UnifiedViews [23], a batch-based ETL framework with native support for RDF data, that
allows wrapping existing Java libraries as data-processing units and compose pipelines using a graphical
user interface. The open source version of UnifiedViews seems not to be maintained anymore, in favor
of a commercial version integrated to the PoolParty Semantic Suite7.</p>
          <p>Bonte et al. [24] presented a survey of stream reasoning systems and specify a lifecycle model for
Streaming Linked Data. The lifecycle model consists of 6 stages:
1. Name: Identify data streams with an IRI.
2. Model: Use a data model that accounts for both data and metadata.
3. Shape: Define the smallest unit of data in a stream (e.g., triple-based or graph-based).
4. Annotate: Transform non-RDF data into RDF.
5. Describe: Include interoperable metadata for discovery purposes.
6. Serve: Define format and protocol for data sharing.
7. Query: Consume the data stream via a querying process.</p>
          <p>Although this work does not provide a specific pipeline framework, the lifecycle model can be used
to define a set of processors for a streaming pipeline. Klironomos et al. [ 25] introduced ExeKGLib, a
Python library to build ML pipelines described as a knowledge graph. The library is able to generate
the corresponding Python scripts from the knowledge graph description, based on a set of data science
operations (e.g, feature engineering, visualization, model training, etc). Pipeline executions are done in
a batch-based manner and so far, there is only support for Python-based operations. Sicilia et al. [26]
introduced an ontology that attaches semantic descriptions to data and analytic transformations that
are integrated with data pipeline code. Concretely they apply this approach to data pipelines from the
Apache Beam framework. Lastly, Chakraborty et al. [27] present a survey of ETL tools for semantic data
integration. It concludes with a set of open challenges for the field, such as the need for automation
and visualization tools.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. RDF-Connect</title>
      <p>RDF-Connect defines a declarative and streaming pipeline framework for cross-environment data
processing. It aims to enhance interoperability among data processing libraries that excel in specific
tasks. This necessity is particularly prominent within the Linked Data ecosystem due to the limited
number of robust implementations. RDF libraries are primarily written in three major programming
languages: JavaScript, Java, and Python. Developing software at a high TRL (Technology Readiness
Level) demands significant efort. To advance the RDF ecosystem, it is ineficient to reimplement the same
functionality across all programming languages. Instead, the best implementations should be able to
interoperate within a single pipeline. RDF-Connect addresses this interoperability challenge by defining
a common vocabulary to describe data processing units and their interactions, and implementing
abstraction layers per execution environment that conceals the underlying programming language
specifics of a given data processing unit.</p>
      <sec id="sec-3-1">
        <title>6a distributed python-based task-queue system: https://github.com/celery/celery 7https://www.poolparty.biz/poolparty-unifiedviews</title>
        <p>For instance, consider SHACL validation as an example. The task is to determine whether a given
RDF graph is validated according to a given SHACL shape. While the SHACL specification clearly
defines the requirements for a SHACL engine implementation, providing full specification coverage
can be challenging for developing engines from scratch. Performance is another critical factor; the
validation step must not become a bottleneck in the pipeline, which is a risk if a validator is implemented
in a slower programming language. With RDF-Connect, only a single high-TRL SHACL validation
implementation is necessary, so that high performance validation can become possible across diferent
pipelines.</p>
        <p>Another example of the value of RDF-Connect is evident when integrating a pipeline into an
organisation’s infrastructure and some custom code is necessary. This code, being specific to the organisation, is
unlikely to be reused by others and thus can be written in a language that facilitates rapid development
and suits developers expertise. RDF-Connect mitigates the tension between application speed and
development speed.</p>
        <sec id="sec-3-1-1">
          <title>3.1. General idea</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>RDF-Connect focuses on three main objectives:</title>
      </sec>
      <sec id="sec-3-3">
        <title>1. The choice of programming language should not impair the pipeline.</title>
        <p>2. New components should be easy to build.
3. Pipelines should be able to be validated before running.</p>
        <p>To achieve these objectives, RDF-Connect is divided into four components:
1. Processors: These are small execution units that perform a single and usually simple task to
foster reusability.
2. Runners: Runners target a specific execution environment (like Javascript) in which they initiate
one or more processors with the provided pipeline configuration. They also construct the
configured channels before starting the pipeline.
3. Channels: These are abstract entities responsible for data transfers among processors.
4. Pipeline Configuration: An RDF document describing various interlinked processor instances
working together to accomplish a task.
rdfc-js:Processor
rdfc-jvm:Processor
rdfs:subClassOf
rdfs:subClassOf
rdfc:Processor
rdfc:executedBy
rdfc:Runner
rdfc:supportsChannel</p>
        <p>rdfc:Channel
rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt;
rdfc: &lt;https://w3id.org/rdf-connect#&gt;
rdfc-js: &lt;https://w3id.org/rdf-connect/js#&gt;
rdfc-jvm: &lt;https://w3id.org/rdf-connect/jvm#&gt;</p>
        <p>Figure 1 shows the main concepts of the RDF-Connect data model, published online at https:
//w3id.org/rdf-connect. Language-specific Processors are executed by language-specific Runners, which
support specific types of streaming communication Channels. SHACL shapes8 accompany concrete
processor and channel definitions providing descriptions for input and output parameters and their
rdfc:reader
rdfc:writer
rdfc:ReaderChannel
rdfs:subClassOf
rdfs:subClassOf
rdfc:WriterChannel
constraints. The vocabulary can be easily extended to define coverage for additional execution
environments and communication means. Programming language barriers are overcome by splitting
logic and sending/receiving data over channels. Channels are programming language-independent
and based on well-known communication protocols, such as HTTP request/response interactions,
since most languages provide support for sending HTTP messages and starting an HTTP endpoint.
However, channel logic is abstracted from processors so that each processor acts on event-driven
messages regardless of how those messages are delivered. Each type of channel has its advantages
and disadvantages. HTTP channels are easy to set up and relatively fast but lack replay-ability and
fault tolerance. Kafka streams, for example, are another option that ofers replay-ability and logging,
ofering a more feature-rich type of channel to connect processors. For consecutive processors written
in the same language, in-memory channels can be used share data eficiently. Note that this does not
compromise the first objective; all pipelines remain configurable, although some configurations may
ofer some trade-ofs in terms of performance. As mentioned before, processors are channel-agnostic to
ensure maximum reusability, allowing the pipeline manager to choose the optimal channel for each
step. A complete list of supported channels is described in Section 3.2.</p>
        <p>Creating a new processor is language-dependent and should be approached from the corresponding
runner’s perspective. For instance, to run a JavaScript processor, only the JavaScript file and the
processor function are required. This simplicity facilitates the creation of new processors, as each
processor merely needs to describe the file, the function and annotate the required parameters. This
information ensures consistent and reliable initiation of the processor. Each processor should be
accompanied by a SHACL shape, which runners can use to determine the processor’s parameters.
SHACL shapes enable the validation of a pipeline before it is executed. Additionally, these shapes and
processor configurations allow for the derivation of provenance information from pipelines.</p>
        <p>A pipeline is runner-agnostic, which makes it possible to refer to a pipeline, without knowing the
underlying implementation, or the underlying runners. All information required for the runner to
start a processor is already present in the processor definition. Listing 1 shows a pipeline with three
processors of type rdfc-js:Send, rdfc-js:Append and rdfc-jvm:Print. The purpose of the pipeline is to send
the message "Hello" from the first processor to the second, which appends the message with "World"
and sends it to the third processor, which in turn prints the message in the console. This pipeline should
be started with the JVM and the JavaScript runner. The runners expands all owl:import statements to
ifnd the required processor definition files, both locally and remotely. Each runner starts all processors
that are defined against that runner. When everything is started, the application runs.</p>
        <sec id="sec-3-3-1">
          <title>3.2. JavaScript runner, processors and channels</title>
          <p>As mentioned before, a runner is the abstraction layer between the executing processors and the
complete pipeline that constitutes the application. It handles parsing arguments, setting up channels
and providing useful feedback when the execution fails. We implemented a JavaScript runner that
can execute a JavaScript function as a processor. The code is open source and available with an MIT
license9.</p>
          <p>This was our first runner and gave us initial experience in understanding what channels were useful.
We implemented the following channels for the JavaScript Runner:
• HTTP: The HTTP channel is the main channel used for crossing language barriers. The only
configuration the HTTP channel needs is a port and a host to start the endpoint. When the
channel is set up, messages can cross programming language barriers.
• Websocket: The WebSocket channel is very similar to the HTTP channel but might improve
throughput slightly. Note that web sockets are bidirectional, but the reverse direction is not used,
as this would break channel-agnostic implementations.
• File: The file channel is an interesting channel. It allows communication over files. Writing to this
channel is writing a file and the reader part of the channel can use file watcher to detect changes.
9https://github.com/rdf-connect/js-runner/tree/rdf-connect-paper</p>
          <p>A change is interpreted as a new message. This channel is particularly useful for integrating
configuration files in an idiomatic way into RDF-Connect.
• Kafka: The Kafka channel is the only channel that has replay-ability and fault tolerance. If some
part of the pipeline breaks, the messages are not lost but still exist on the Kafka topic created by
the channel. It also facilitates to integrate a pipeline with already existing Kafka streams.
• In-memory: This channel is the main channel used to let JavaScript processors communicate
with each other. It is very fast and easy to use as the message never leaves the JavaScript heap
and avoids data copying.</p>
          <p>The JavaScript runner requires all processors to denote the location of their implementation file and
function name. This gives enough information to import the file and call the function at runtime. Each
processor also comes with a SHACL shape denoting the required parameters. In previous iterations, we
used the Function Ontology (FnO) parameter mappings for this but it was found too verbose to write
and has been deprecated (although some processor description may still use it). Now each processor
gets a single argument built with RDF Lens10 derived from the SHACL shape. RDF Lens gives great
lfexibility in the objects that are used as arguments. An example of a processor definition for the
JavaScript runner can be found on Listing 2, the corresponding implementation is found at Listing 3.</p>
          <p>].</p>
          <p>Listing 2: Example configuration of JavaScript processor including location, file, function name and
required argument shape.
1 export function resc(args) {
2 args.input.data((input) =&gt; {
3 console.log(args.msg, input)
4 });
5 }</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Listing 3: Example implementation of JavaScript processor.</title>
        <p>The JavaScript runner works in three steps, first, it loads the pipeline to determine all configured
JavaScript processors and try to extract their arguments. Then each processor is started one by one.
Each processor can return a function that will be called when all processors are started, this alleviates
race conditions and allows for particular start-up sequences, like connecting with a database. Then all
functions returned by processors are called and the pipeline starts its execution.</p>
        <sec id="sec-3-4-1">
          <title>3.3. Targeting the JVM</title>
          <p>The Java ecosystem for RDF contains many mature, performant and feature rich RDF applications and
libraries, making it a valuable target for RDF-Connect. We set out to build a runner which bridges
these applications with those targeting JavaScript, in order to greatly increase the potential for future
pipelines. We have written a proof-of-concept runner using the JVM-compatible language Kotlin11,
capable of executing not only processors written in Kotlin or Java, but essentially all languages which
are capable of generating JVM class files, such as Scala.</p>
          <p>In contrast to the JavaScript runner, we do not implement our processors as a single function. Instead,
we have adopted a more traditional object-oriented approach, defining a common abstract class which
all processors must inherit. For example, we have delegated initialization to the extendable constructor,
and execution of the main logic into to the abstract exec method. Handling state and complex logic
is more straightforward in this manner, since many additional methods and fields can be defined as
desired. Arguments are passed to the constructor as a simple string-to-object map wrapper, removing
10https://github.com/ajuvercr/rdf-lens
11The code is open source and available with an MIT license https://github.com/rdf-connect/orchestrator/tree/
rdf-connect-paper
the requirement for a specific method signature. Additionally, using Kotlin’s reified generic types, we
provide a convenient and simple API which allows for type-safe retrieval of arguments at runtime using
compiler inferred types.
1 class Transparent(args: Arguments) : Processor(args) {
2 private val input: Reader = arguments["input"]
3 private val output: Writer = arguments["output"]
4
5 override suspend fun exec() {
6 output.push(input.read())
7 }
8 }</p>
          <p>Listing 4: A simple Kotlin processor which reads data from an incoming channel and pushes it verbatim
to a writer.</p>
          <p>Channels are implemented much like their JavaScript counterparts, but instead of relying on promises
and callbacks, we make heavy use of the Kotlin Coroutine library. Most of the communication here
is implemented using Kotlin’s channel primitives, which are highly similar to those more popularly
found in the Go language. Interoperability with HTTP, Kafka, and so forth, is simply implemented as a
collection of data producers and consumers of those channel primitives. As a result, the actual data
source and destination are completely abstracted by a common reader and writer interface.</p>
          <p>The design decisions above showcase how developers of new RDF-Connect runners have a large
degree of freedom implementation-wise. Despite these diferences, declaring processors is still done in
a similar manner to the JavaScript runner, including SHACL shapes defining the required arguments.
Adding a processor as part of a pipeline is identical to the JavaScript runner, as the pipeline configuration
is runner-agnostic.</p>
          <p>In its current state, coordinating the two runner must be done manually, since one runner cannot
start or terminate another. Executing a pipeline spanning multiple runners typically requires a simple
shell script which starts each runner individually, all pointing to the same configuration file. This, as
well as other limitations and opportunities, are a key focus of our future work.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4. NiFi runner: interacting with other frameworks</title>
          <p>Apache NiFi is not a programming language, but a pipelining tool to design an automated dataflow
pipeline, with a focus on the ability to operate within clusters, security using TLS encryption, and
extensibility. Apache NiFi also has an API, which can be used to list processors and build pipelines.
This is enough to build a NiFi runner. In a previous version of RDF-Connect, we built a NiFi runner that
can generate processor descriptions based on NiFi processors retrieved with the API. These processors
could be used within a pipeline just like any other processor To instantiate those processors, the NiFi
runner made the correct API calls on the NiFi cluster. Just like the JavaScript Runner has in-memory
channels, the NiFi Runner also had NiFi channels as its main way of communication. Other channels
like HTTP, were also supported by instantiating a NiFi template that worked like an HTTP endpoint
in other languages. We dropped support for NiFi in newer versions because we had no demand for
it within our use cases at the moment. However, it was shown that it was feasible and bringing back
support for it remains possible. The original code is still available on github12.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Interoperable pipeline</title>
      <p>This section covers a real-world pipeline that loads incoming sensor data into a triple store. The sensor
data originates from The Things Network13, a global collaborative Internet of Things ecosystem using
LoRaWAN®. The pipeline and its installation instructions can be found on GitHub14.
12https://github.com/ajuvercr/nifi-runner
13https://www.thethingsnetwork.org/
14https://github.com/ajuvercr/rdf-connect-paper-pipeline</p>
      <sec id="sec-4-1">
        <title>4.1. Pipeline structure</title>
        <p>A visual representation of the pipeline can be seen in Figure 2. The pipeline contains multiple processors
working together to achieve the goal. The blue hexagons represent incoming data, while the green
hexagon represents the output.</p>
        <p>Glob read
processor
YARRRML
mappings</p>
        <p>Javascript
In-memory
channel</p>
        <p>YARpRrRoMceLstsoorRML</p>
        <p>TTN Post
endpoint
(webhook)</p>
        <p>Javascript
In-memory
channel
RML mapper
processor
SPARQL endpoint</p>
        <p>SPARQL query
processor</p>
        <p>Javascript
In-memory
channel
Javascript
In-memory
channel</p>
        <p>SHACL validator
processor</p>
        <p>The first processor creates RML [ 28] from YARRRML [29] rules, a more user-friendly notation for
mappings15. The processor is implemented in JavaScript and relies on the yarrrml-parser library written
in JavaScript16. The YARRRML file is sourced from a GlobRead processor, which has a single function:
reading files according to a specified glob pattern 17.</p>
        <p>The next processor executes an RML mapping for each incoming JSON object received from The
Things Network (TTN). This is a JavaScript processor that wraps around the RMLMapper JAR file.
Attempts to implement this processor in Java were unsuccessful due to various dificulties using the
application code as a library18. If the GlobRead processor finds two files, the RML mapper processor
executes both mappings.</p>
        <p>The third processor checks if the generated RDF is valid according to KWG-SHACL19. This step is
crucial, as invalid data should not contaminate the triple store. SHACL validation is implemented in multiple
languages, and with RDF Connect, it is easy to swap one implementation for another. This pipeline was
executed twice: once with a JavaScript processor wrapping around rdf-validate-shacl20, and once
with a Java processor wrapping around org.apache.jena.shacl21. The shape had to be adapted in
minor ways to allow for simple RDF generation and allow for diferences in the validator processors.
For example, the observations are observed by a sensor, but only a stub for the sensor is generated
using our simple mappings. This results in invalid objects, but to demonstrate the functionality, we
altered the shape file.</p>
        <p>Since the JavaScript SHACL validator processor did not exist, we developed a new processor to
integrate into the pipeline. The processor is minimal, comprising only 25 lines of JavaScript code
and 40 lines of RDF configuration. This experience highlights how straightforward it is to create new
processors, reinforcing the practice of building small, reusable processors that perform a single task
eficiently.</p>
        <p>The final processor is again a custom processor, creating a SPARQL INSERT query from the incoming
15https://rml.io/yarrrml/
16https://github.com/RMLio/yarrrml-parser/blob/development/README.md
17https://en.wikipedia.org/wiki/Glob_(programming)
18https://github.com/RMLio/rmlmapper-java
19https://github.com/KnowWhereGraph/KWG-SHACL
20https://github.com/zazuko/rdf-validate-shacl
21https://jena.apache.org/documentation/shacl/index.html
triples, optionally placing the triples in a specific graph and executing the SPARQL INSERT query on a
triple store using the fetch-sparql-endpoint library.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Mapping entities</title>
        <p>Integrating the pipeline into the real world is accomplished using channels. New data is ingested via a
POST request at the beginning of the pipeline. With The Things Network, one specifies a webhook that
sends a JSON object for each measurement made by sensors. This JSON object contains information
about the node that made the observation as well as a decoded payload, which is custom for each data
logger node. An example of our payload is shown in Listing 5.
1 {
2
3
4
5
6
7 }
"battery": 3.3299999237060547,
"humidity": 40.12785339355469,
"pressure": 1007.492919921875,
"temperature": 21.81758689880371,
"version": 2</p>
        <sec id="sec-4-2-1">
          <title>Listing 5: Example of the decoded payload in a TTN message</title>
          <p>Our YARRRML mapping file reflects this structure by generating a sosa:Observation for each
ifeld in the payload separately. To test our pipeline, we intentionally made a mistake in the mapping
for humidity, creating invalid sosa:Observations. When running the pipeline, the configured triple
store will not contain any humidity observations as these are rejected by the SHACL validator.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Swapping a processor</title>
        <p>Currently, all processors are JavaScript-based. This simplifies running the pipeline; a single command,
npx js-runner pipeline.ttl, initiates and executes the pipeline. However, if we are not satisfied
with the current SHACL validator (e.g., for performance reasons), it can be easily exchanged by an
alternative implementation, such as a Java-based processor mentioned above.</p>
        <p>This can be achieved by modifying the configuration to utilize a Java processor instead of the
JavaScript processor. The Java runner cannot integrate JavaScript in-memory channels, so the incoming
and outgoing channels need to be changed to, for example, HTTP channels. Both the JavaScript runner
and the Java runner will establish HTTP endpoints, allowing for message transfer across the language
barrier. The concept of the configuration remains unafected by these changes.</p>
        <p>This example, although artificially fabricated, efectively demonstrates the versatility of RDF-Connect.
The ability to efortlessly swap out a processor to verify an implementation or replace it with a more
performant processor is advantageous. Additionally, file channels exist, which read and write to a
ifle. This feature is invaluable for debugging the pipeline, as it allows for immediate input and output
verification by writing to and reading data to/from disk.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper we introduced our ongoing work on RDF-Connect, a declarative and streaming pipeline
framework. We showed how RDF-Connect is able to abstract programming language specifics and
enable communication of data processing functions, implemented in diferent languages over
languageagnostic channels. In this way we accomplish our goals of providing support for multi-lingual and
streaming pipelines. We also showed an example of how we are applying RDF-Connect in a real use
case, where we are able to load and validate incoming sensor data into an RDF triple store.</p>
      <p>As previously mentioned, executing pipelines which span multiple runtimes and environments is
currently not as simple as launching a simple binary. More importantly, the development of runners
targeting new runtimes and languages is also limited due to the high complexity of parsing the
configuration file, setting up inter-process communication channels and routing messages. Therefore, we wish
to greatly simplify RDF-Connect by designing and developing a new platform-agnostic orchestrator,
which will do much of the heavy lifting currently required by all runners individually. The orchestrator
should be responsible for starting the runners, either locally or remote, and facilitate communication
between them. This leaves only one channel type without configuration, reducing the complexity of
pipeline configurations. Reducing the complexity of the pipeline also reduces the potential for human
errors. Finally, individual runners should not need to interpret the pipeline configurations themselves.
Rather, we will map the RDF model to a simple and intuitive configuration representation at runtime in
order to relief the individual processors of building and querying their own triple store. By doing so,
we want to make it significantly easier to bring RDF-Connect to new runtimes and environments, such
as low-level languages like Rust, to take advantage of system-level performance gains.</p>
      <p>We will also set out to define a standard way of publishing and finding processors, greatly improving
convenience and ease-of-use, to facilitate an ever-growing ecosystem of processors.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was supported by the MAREGRAPH project (id: 101100771) which is co-funded by the
European Union under the Digital Europe Programme.
M. Zervas (Eds.), Metadata and Semantic Research, Springer International Publishing, Cham, 2019,
pp. 169–180.
[27] J. Chakraborty, A. Padki, S. K. Bansal, Semantic etl — state-of-the-art and open research challenges,
in: 2017 IEEE 11th International Conference on Semantic Computing (ICSC), 2017, pp. 413–418.
doi:10.1109/ICSC.2017.94.
[28] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, RML: a generic
language for integrated RDF mappings of heterogeneous data, in: C. Bizer, T. Heath, S. Auer,
T. Berners-Lee (Eds.), Proceedings of the 7th Workshop on Linked Data on the Web, volume 1184
of CEUR Workshop Proceedings, 2014. URL: http://ceur-ws.org/Vol-1184/ldow2014_paper_01.pdf.
[29] P. Heyvaert, B. De Meester, A. Dimou, R. Verborgh, Declarative rules for linked data generation
at your fingertips!, in: The Semantic Web: ESWC 2018 Satellite Events, Springer International
Publishing, Cham, 2018, pp. 213–217.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Raj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Olsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Modelling data pipelines</article-title>
          ,
          <source>in: 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          . 1109/SEAA51224.
          <year>2020</year>
          .
          <volume>00014</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wardat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rajan</surname>
          </string-name>
          ,
          <article-title>The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large</article-title>
          ,
          <source>in: Proceedings of the 44th International Conference on Software Engineering</source>
          , ICSE '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>2091</fpage>
          -
          <lpage>2103</lpage>
          . doi:
          <volume>10</volume>
          .1145/3510003.3510057.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinkevich</surname>
          </string-name>
          ,
          <article-title>Data lifecycle challenges in production machine learning: A survey</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <volume>47</volume>
          (
          <year>2018</year>
          )
          <fpage>17</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .1145/3299887.3299891.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>U.</given-names>
            <surname>Simsek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Angele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kärle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Opdenplatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sommer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          ,
          <article-title>Knowledge graph lifecycle: Building and maintaining knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the 2nd International Workshop on Knowledge Graph Construction (KGC</source>
          <year>2021</year>
          )
          <article-title>co-located with 18th Extended Semantic Web Conference (ESWC</article-title>
          <year>2021</year>
          ),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rengarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <article-title>Generalizing streaming pipeline design for big data</article-title>
          , in: S. Agarwal,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. P.</surname>
          </string-name>
          Agrawal (Eds.),
          <source>Machine Intelligence and Signal Processing</source>
          , Springer Singapore, Singapore,
          <year>2020</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hlupić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Puniš</surname>
          </string-name>
          ,
          <article-title>An overview of current trends in data ingestion and integration</article-title>
          ,
          <source>in: 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1265</fpage>
          -
          <lpage>1270</lpage>
          . doi:
          <volume>10</volume>
          .23919/MIPRO52101.
          <year>2021</year>
          .
          <volume>9597149</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Isah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Abughofa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahfuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ajerla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zulkernine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>A survey of distributed data stream processing frameworks</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>154300</fpage>
          -
          <lpage>154316</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2019</year>
          .
          <volume>2946884</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shukla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Simmhan</surname>
          </string-name>
          ,
          <article-title>Benchmarking distributed stream processing platforms for iot applications</article-title>
          , in: R. Nambiar, M. Poess (Eds.),
          <source>Performance Evaluation and Benchmarking. Traditional - Big Data - Internet of Things</source>
          , Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kästner</surname>
          </string-name>
          ,
          <article-title>Subtle bugs everywhere: generating documentation for data wrangling code</article-title>
          ,
          <source>in: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering, ASE '21</source>
          , IEEE Press,
          <year>2022</year>
          , p.
          <fpage>304</fpage>
          -
          <lpage>316</lpage>
          . doi:
          <volume>10</volume>
          .1109/ASE51524.
          <year>2021</year>
          .
          <volume>9678520</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Foidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Golendukhina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Felderer</surname>
          </string-name>
          ,
          <article-title>Data pipeline quality: Influencing factors</article-title>
          , root
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>