<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Corresponding author.
$ rene.dorsch@iis.fraunhofer.de (R. Dorsch); michael.freund@iis.fraunhofer.de (M. Freund);
justus.fries@iis.fraunhofer.de (J. Fries); andreas.harth@iis.fraunhofer.de (A. Harth)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GraphGuard: Enhancing Data Quality in Knowledge Graph Pipelines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rene Dorsch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Freund</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justus Fries</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Harth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer IIS, Fraunhofer Institute for Integrated Circuits IIS Nordostpark 84 Nuremberg Germany 90411</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>We present GraphGuard, a data validation framework to improve the data quality of pipelines to populate knowledge graphs. The inputs for these pipelines often come from diferent sources, requiring various approaches for validating the data against diferent defects. This requirement leads to diferent formats for validation reports, which reduces contextual, representational, and accessible quality dimensions of data validation. The proposed framework consists of QualityContracts and Guardians. QualityContracts encapsulate the necessary data validation requirements in both human and machine-readable formats. Software agents, called Guardians, use the machine-readable format to execute validation methods. We validate the practicality of our framework on a deployed data processing pipeline at a large European airport over several months of data. A comparative analysis between a basic data processing pipeline and a pipeline using our framework showed improvements in the data quality criteria of believability, interpretability, ease of understanding, consistency of representation, conciseness of representation, and accessibility.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graph</kwd>
        <kwd>Data Validation</kwd>
        <kwd>Data Quality</kwd>
        <kwd>Process Optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Knowledge graphs (KG) are powerful tools used across various sectors to incorporate and
integrate data from various heterogeneous sources, making them critical for various organizations.
Industrial organizations [1, 2, 3], tech giants[4], and worldwide institutions[5] use knowledge
graphs as a fundamental component of their data-driven strategies for tasks, such as data
aggregation, data analysis, process optimization, and decision-making. Given this extensive
usage, ensuring high data quality within these graphs of knowledge is crucial.</p>
      <p>To incorporate data into knowledge graphs, data integration pipelines are frequently used
[6]. These pipelines consist of multiple loosely coupled components, each executing diferent
tasks to process and transform data. Validation techniques based on the Resource Description
Framework (RDF), such as SHACL [7] and SHEX [8], are frequently employed to maintain
accuracy in knowledge graphs. However, it’s crucial to address data defects, e.g. unavailable data
points or outliers, at their source [9]. Addressing a wide range of data defects requires diverse
validation methods, which can difer based on elements such as the type of data they can validate,
the rules they employ, the procedures they follow, and the presentation of validation results. As
a result, various methods tailored for diverse data types and defect types, along with strategies
like functional redundancy [10], are used to validate data across a single processing pipeline.
These validation methods produce data in the form of validation reports based on the inspected
data. Given the variety of methods used for validation, there is a trade-of between ensuring
high accuracy of data and maintaining homogeneity in validation reports. When high accuracy
is assured, the validation report becomes more heterogeneous, thereby reducing the contextual,
representational, and accessibility criteria of data quality [11]. Lowering these data quality
criteria reduces the usability of the validation reports for use cases such as quality assessment
[12], assurance [13], control [14], and monitoring [15, 16], data auditing and compliance1, and
data analytics [17, 18]. Previous research is limited by its design. These works tend to focus
on reducing heterogeneity by restricting elements of data validation, such as the type of data
being validated, the validation procedures employed, and the rules governing data validation
[19], and overlook the potential of using diverse validation methods in a compatible manner.
For instance, some validation services might require adhering to a specific format [ 19], may be
proprietary [9], or may be limited to a particular data format [7, 8, 20, 21].</p>
      <p>Therefore, we investigate the research question, "How can the data quality of heterogeneous
data validation methods be improved in the context of data processing pipelines for knowledge
graphs?"</p>
      <p>Concretely, we introduce a framework that defines validation rules as reusable components.
These components are available to a software agent that implements the necessary software
to run the validation rules and generate validation reports. This report is available in the
processing pipeline and within a knowledge graph, which contains machine and human-readable
descriptions of validation rules and the reports in RDF. Our approach is capable of executing
the rules established in various validation frameworks while presenting a comprehensive,
homogeneous report of the validation procedure.</p>
      <p>The main contributions of our work are two-fold. First, we formulate an ontology to represent
validation rules, their execution process, and the capabilities of a software agent to execute
them. Second, we conduct a comparative evaluation of our framework in an actual real-world
scenario.</p>
      <p>We present related work and the derived main requirements in section 2. Section 3 introduces
our framework for handling heterogeneous data validation reports. We then describe a use case
from a European airline in section 4. The use case forms the basis for the evaluation in section
5. Finally in section 6, we provide a summary of our proposal, current limitations, and future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>Validation is defined by the European Statistical System (ESS) [ 19], as a "... process which
ensures the correspondence of the [...] data with a number of quality characteristics." Hence,
1cf. https://cros-legacy.ec.europa.eu/content/overview-data-and-metadata-exchange-ess_en
here we present related research on comprehensive data validation frameworks; frameworks
to describe rules for quality characteristics; and the derivation of executable processes from
knowledge graphs. Based on the related work, we establish various requirements to design a
framework that is capable of handling various validation methods.</p>
      <sec id="sec-2-1">
        <title>2.1. Related Work</title>
        <p>The data validation framework of the ESS [19] validates statistical data from diferent European
countries to generate EU-wide reports. The validation system requires a domain-specific
language, called Validation and Transformation Language, that was designed to validate the
structure and content of datasets. The reports from the ESS are described in a machine- and
human-readable format to improve transparency and reuse of the reports. Ericson [9] presented
their validation framework for machine learning-enabled software systems. Similar to the
validation system of the ESS, it is able to validate the structure (e.g. size of dataset), and
content (e.g missing values, range of values, misformats, outliers). To handle the diferent
required validation methods, they developed a proprietary validation library. In summary, both
frameworks are able to validate datasets on their structure and content but require the migration
to a specific validation system.</p>
        <p>Cross-organizational validation methods primarily rely on open protocols defined by contracts
or specifications. Examples include the frictionless-schema [ 21], which validates tabluar data; the
json-schema [20] for JSON-serialized data; and SHACL [7] or SHEX [8], for RDF-described data.
However, these methods predominantly focus on schema or pattern validation, necessitating
additional frameworks to validate other data characteristics, such as outliers or duplicates.</p>
        <p>Zheng et al. [22] employ knowledge graphs to depict and operationalize machine-learning
pipelines in RDF format. These pipelines are made executable using a specialized library, which
integrates machine-executable semantics into the code. However, a notable limitation of this
method is its reliance on a library, hiding the underlying rules and algorithms. Similar to
the preceding methods, this approach raises questions about transparency and modifiability.
Specifically what the algorithms execute, how the rules can be accessed, and how new methods
for processing data can be integrated.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Requirements</title>
        <p>To derive data quality requirements for data validation, we used the framework of data quality
introduced by Wang et al. [11] as it is well-established, provides detailed characterizations of
data quality dimensions, and can be used for data quality assessment [23] and improvement
[24]. Wang et al. identified 16 criteria to characterize data quality and grouped them into four
dimensions: intrinsic, contextual, representational, and accessible. Based on the dimension we
derived the following requirements for handling heterogeneous data validation. The intrinsic
quality dimension covers the extent to which data values correspond to reality. To maintain a
high level of intrinsic data quality, any validation method must be allowed to achieve the highest
accuracy in knowledge graphs [9, 10, 25]. The accessibility quality dimension is concerned with
the extent to which data is available or obtainable. To enhance the accessibility data quality, it
is necessary to report validation rules and the outcomes of such techniques separately from
their implementation. This ensures that users can quickly and easily retrieve information about
the implemented techniques in the pipeline. In addition, knowledge graphs ease the connection
of data with its validation reports. Therefore, the outcomes of the validation methods should be
integrable in the knowledge graph. The representational quality dimension focuses on the extent
to which data can be represented in an understandable and transparent manner. This means
that reports should be concise to facilitate the varying needs of diferent users. Additionally,
they should be presented in a human and machine-readable format [26] enabling analysis based
on the reports for diferent use cases [ 25]. The contextual quality dimension focuses on the
extent to which data is applicable or relevant to the tasks of a data consumer. To provide
higher contextual data quality, validation constraints should be quickly accessible, reusable, and
modifiable [9].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. GraphGuard Framework</title>
      <p>Based on the given requirements, we developed GraphGuard a framework to support
heterogenous data validation. In this section, we present the two main components, i.e., QualityContracts
and Guardians. QualityContracts are formal specifications that define a set of constraints for
acceptable data. Guardians are software agents responsible for enforcing these constraints by
validating data against QualityContracts. Based on the validation result, Guardians generate a
QualityValidationReport.</p>
      <p>The general data model for the developed ontology2 is provided in figure 1. We reused existing
ontologies to promote interoperability and to build upon established standards. We used the
Data Quality Vocabulary (DQV)3 to express data quality metadata, Open Digital Rights Language
Ontology (ODRL)4 and the Profiles Vocabulary (PROF) 5 to express the contracts and constraints,
Data Catalog Vocabulary (DCAT)6 and the Software Package Data Exchange Ontology (SPDX)7
to describe data exchange, and the Provenance Ontology (PROV)8 for provenance information.</p>
      <sec id="sec-3-1">
        <title>3.1. QualityContracts</title>
        <p>A QualityContract is an RDF-document, embedding descriptions about the general purpose of
the validation procedure and metadata about the contract usage for a specific dataset. To improve
readability, a QualityContract always contains a human- and machine-readable description of
its content. The human-readable description contains an association with the author and the
time when the contract was generated. The human-readable descriptions in QualityContracts
enhance the understandability of the contracts for data stewards and domain experts, while
machine-readable formats allow for automated processing by software agents like Guardians.
2http://www.purl.org/graphguard/ontology
3https://www.w3.org/TR/vocab-dqv/
4https://www.w3.org/TR/odrl-model/
5https://www.w3.org/TR/dx-prof/
6https://www.w3.org/TR/vocab-dcat-2/
7https://www.spdx.org/rdf/terms/
8https://www.w3.org/TR/prov-o/
.rvxsppdoddP&lt;dsaprrccqd&lt;foodltdarvsr::plfv::tx::::e:r:ovfv:iaExln:&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;Qthhhhhhhhhheituttttttttttytttttttttt,sppppppppppad::::::::::l//////////iq//////////tspwwwwwwwwyvpuV:wwwwwwwwQdralwwwwwwwwx.uloi.........adoswwwwwwwrglerai3333333gt/mtyd......./ioooooooMrocadrrrrrrr/ngggggggnetfe/R///////tatinnnn22nrecsemsssss00rwupm/////00spodddreo01e/srxcdqb&gt;ro//m#/arv0X.tplvot#&gt;/1eMr#2#r&gt;o/ng&gt;/&gt;rLf&gt;dt//&gt;qS&gt;f-u&gt;csahclehi1tmyeVmaa#al&gt;i#d&gt;ation#d&gt;qv:issddpMprccottdev::ffcx:aao&lt;ho:sclr&lt;an:umhQpsrfeomaArvuocrtaremakft(&lt;ln:ils1:Rfsit&lt;aQutO)Teyp:cmuofsVtrdpaoo(ac(r(1lv1uo1tli::it)v)r)MAd:y:c::uarcVedes11dsttDcapd.ieifv.tosldi*ed:iia:StdnxsRyT:taRc&gt;Cayerte&gt;pnisihposeodetnoOouacurrrkr&gt;cdEsre&gt;cuxemtent 1h.a.s*Rperosof:u10rc..e..** rvdafls:s:ceovme&lt;rme&lt;ivpetyarnolt(:f0(Q:P1..u.1r.oa*)):fli:ilvtexya,sCl:ddSoq:senvtv:rsMietnrregiattyirnLicte&gt;v&gt;el
vdaqlv:r:veaplouret ((11))::xxssdd::sbtorionlgean 1geneprraotve:dBy1 pprroovv::setnadrteeddAAttTTimimee((11))::xxssdd::ddaatteeTTimimee
1..*
1
1
1..*
prov:used
1..* 1..*</p>
        <p>1..*
prof:isProfileOf</p>
        <p>
          1..*
&lt;&lt;dqv:QualityPolicy&gt;&gt;
val:QualityContract
prov:generatedAtTime (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ): xsd:dateTime
prov:generatedBy (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ): prov:Entity
d1cat:identifier(1..*): rdfs:Literal
rdfs:comment (1..*): xsd:string
        </p>
        <p>
          1
dcterms:conformsTo odrl:target
1..* 1
&lt;&lt;odrl:Asset&gt;&gt;
dcat:Dataset
dcat:format (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ): dcat:MediaTypeOrExtent
dcat:downloadURL(0..*): rdfs:Resource
dcat:identifier(1..*): rdfs:Literal
1
dqv:
wasAttributedTo
        </p>
        <p>1..*
&lt;&lt;prov:SoftwareAgent&gt;&gt;</p>
        <p>val:Guardian
rdfs:label (1..*): xsd:string
dct:conformsTo (1..*): dct:Standard
dct:format (1..*): dct:MediaTypeOrExtent
prov:wasAssociatedWith
1..*</p>
        <p>dqv:computedOn</p>
        <p>QualityContracts have at least one QualityConstraint for a single dataset describing validation
constraints on the data. QualityConstraints are specialized prof:Profiles designed to
specify constraints of QualityContracts in a human- and machine-readable format. Violations
of these constraints can be defined with diferent levels of severity. For instance, the three levels
"info", "warning", and "error" can be used. The "info" level is utilized to describe potential defects
not afecting the data validation any further. The "warning" level is used to provide information
about potential issues identified by the QualityConstraint. The "error" level signifies a violation
of a validation rule that needs to be fulfilled. If data does not fulfill the constraint, the validation
process interrupts, a report is sent to a knowledge graph, and the process stops. Not enforced
constraints recognize the defect in the data and report the defect to the graph, but do not enforce
handling the cause. This reporting procedure enables traceability of defects in the data without
interrupting the processing pipeline.</p>
        <p>QualityValidationResources are defined for the execution of QualiyConstraints.
QualityValidationResources describe machine-readable and executable validation rules. We separated
QualityConstraints from specific resources, to provide a modular approach to data validation.
This separation ensures that QualityContracts remain flexible and can easily accommodate
diferent types of validation rules without overhauling the entire constraint. For instance, a
QualityContract can be defined for a dataset of measurements from diferent sensors, where diferent
QualityConstraints are defined for diferent types of defects. A QualityConstraint can be defined
to validate the data from the dataset against outliers. Multiple QualityValidationResources
implement diferent methods, e.g., time-based, depth-based, or distance-based methods, that
detect diferent types of outliers and ensure higher accuracy in the data. Hence, our approach
enables an easy extension of existing QualityConstraints by further QualityValidationResources.</p>
        <p>The QualityValidationResources are essentially the functional components of a
QualityContract. The resources implement technical or domain-specific validation rules to evaluate the
results of a pipeline component. Technical validation rules describe assumptions about data
types and meta-properties of a dataset, such as the existence or uniqueness of values.
Domainspecific validation rules focus on rules related to a specific domain of interest. For instance,
expert knowledge about the processed data can be expressed with that.</p>
        <p>[] a val:QualityContract ;
odrl:target [
a dcat:DataSet ;
dcat:identifier "my_sensor_data.csv";
dcat:format "text/csv"] ;
prov:generatedAtTime "2023-08-14T07:51:28.607114" ;
prov:wasGeneratedBy :me ;
prof:isProfileOf &lt;my_constraint&gt;.
&lt;my_constraint&gt; a val:QualityConstraint ;
rdfs:comment "The data must always have a MAC-Address, a datatype,
˓→ and a single datum in each row.";
prof:hasResource [
a val:QualityConstraintResource;
prof:hasArtifact &lt;my_sensor_constraints.yml&gt;;
dcat:format "application/yaml";
˓→dcat]:.conformsTo &lt;https://specs.frictionlessdata.io/data-package/&gt;
field-sn:ame: MAC
type: string
format: default
pattern: "^[0-9a-f]{2}(:[0-9a-f]{2}){5}$"
- name: value
type: number
conrsetqruaiirnetds::true
- name: type
type: string
constraints:</p>
        <p>enum: ["Temperature", "Pressure",
˓→ "Humidity"]</p>
        <p>To allow software agents the interpretation and execution of QualityValidationResources
each resource requires an Artifact (prof:hasArtifact) that refers to machine-executable
code. An Artifact can be any validation algorithm ensuring the QualityValidationResource. To
ensure the integrity of the Artifact the QualityValidationResources provides a Checksum. The
Guardian interprets the Artifact based on the specified format ( dct:format) and the required
environment (dct:conformsTo) that is defined as a standard. Figure 2 illustrates an example
of a QualityContract for a dataset that contains tabular data. It shows that a dataset called
”__.” with the media type "text/csv" needs to fulfill a QualityConstraint. The
QualityConstraint is a constraint restricting the format of the MAC address and the type of
sensor information provided. To validate the data from ”__.”, the artifact
”__.” will be used by a Guardian that can interpret
frictionlessschema specifications (section 3.2) that are defined in the "application/yaml" format.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Guardians</title>
        <p>Guardians are prov:SoftwareAgents, which enforce QualityContracts, by validating data
in data processing pipelines. Their primary role in the pipeline is to act as gatekeepers,
ensuring that only data conforming to the QualityContracts, gets processed. Figure 3 illustrates
the execution sequence of a Guardian for a dataset that will be validated and the resulting
QualityValidationReport.</p>
        <p>At the start of data processing, the data is loaded and the Guardian is initialized. The
initialization process involves querying contracts from a designated knowledge graph that contains the
QualityContracts for the particular dataset. QualityConstraints and QualityValidationResources
are obtained from the QualityContracts. The integrity of the QualityValidationResources is
verified using the supplied Checksum and compared against the standards that the Guardian
can conform to (dct:conformsTo). In the presented example, the Guardian can conform to
QualityConstraintResources developed as frictionless-schemas or Python code that does not
require any specific libraries. In the final phase of the initialization, the Guardian loads the
Artifacts of QualityValidationResources to enable validation.</p>
        <p>ProcessingComponent</p>
        <p>Initialize Guardian
Load Data</p>
        <p>Guardian
Query QualityContract Information</p>
        <p>Return QualityContract Details
Validate Integrity &amp; Conformance of Resources</p>
        <p>Load QualityConstraintResources</p>
        <p>Acknowledge Initialization
ProcessingComponent</p>
        <p>Guardian</p>
        <p>Validate Data</p>
        <p>Send QualityReport
Acknowledge Receipt</p>
        <p>The data is validated against rules (e.g., according to figure 2) induced by the
QualityValidationResources, which produces a QualityValidationReport (e.g., according to 3). The
QualityValidationReport presents the outcome in a format that is understandable to both machines
(dqv:result) and humans (val:report). Furthermore, it provides insight on the
QualityValidation process that was conducted. The machine-readable result can either be true or false,
representing a successful or failed validation of the data. To enable tracking of validation
activities and to document the results, the QualityValidationReport is sent to the knowledge
graph. With this information, an assessment can be made regarding the efectiveness and
compliance of validation methods. If the results indicate the need, a notification system may
be activated to inform users about the current status of the validation step. After successful
validation, the Guardian reports its results back to the pipeline. If the validation fails, the result
is passed to the processing pipeline, allowing the handling of the exception.</p>
        <p>As a proof of concept, we implemented the Guardian in Python as a reusable library9. The
provided library enables Guardians to automatically query a knowledge graph for
QualityContracts for a given dataset; interpret QualityValidationResources for an implemented standard
(currently pure Python code, data provided from the frictionless schema, and SHACL rules);
validate data based on the provided validation rules and methods; and send validation reports
to a specified knowledge graph.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Case Study</title>
      <p>We implemented the framework at the Munich Airport that handles an average of 80.000
passengers daily. The airport employs the knowledge graph for various applications, including
generating insights about the performance of the baggage handling system, analyzing baggage
traces with process mining, and optimizing the luggage handling process with data-driven value
stream analysis. The deployment of the knowledge graph pipeline at the airport is shown in
ifgure 4. Various cleaning (step 1), processing (step 2), transformation (step 3), and mapping
9https://github.com/wintechis/guardian
rules (step 4) are applied to the source data on a daily basis. Data is automatically extracted from
the graph for further analysis and enriched with additional information (step 5). The data is
validated at each step, and the validation reports are incorporated into the knowledge graph to
improve transparency. Implemented validation methods are used to validate data, for instance
against the following defects:
• Duplicates: The baggage identifiers used by each airline are repeating over a specific
range of numbers. This requires (in step 1) to recognize and handle diferent baggages
with the same identifier.
• Corrupted data: Data may also be compromised by defects invoked during check-in.</p>
      <p>During check-in, employees of the airline can add information in a free text field. This
can cause corruption during serialization (step 1), which afects the processing of the
data.
• Unavailable data attributes: The baggage handling process relies heavily on messages
from airlines. These are standardized messages [27], that include many optional
information in diferent formats. The data needs to be validated before processing (in step 3 and
step 5) to prevent the generation of URIs for non-existent information during mapping.
• Contextualized data: The standardized messages include information, that can only be
understood within the context (e.g. last 5 messages) of multiple messages. To prevent
defects, data needs to be validated (in step 2) such that some information can not exist
twice in the same context.</p>
      <p>We have implemented our framework on top of the deployed processing pipeline for the
knowledge graph. The framework replaces previous validation procedures by introducing
Guardians that perform the same validation procedures. The validation results are stored in
the same knowledge graph as the integrated data, which allows faster access to the validation
reports. This implementation has improved the transparency of the processing pipeline. It
allows data analysts to familiarise themselves with the data and its processing, build greater
confidence in the data, and provide a clear and traceable perspective on the results of validation
procedures for diferent data sources.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>In this section, we evaluate the framework according to the quality dimensions outlined in
section 2. First, we present the evaluation framework and results. Second, we provide a detailed
comparison of the diferent dimensions of data quality.</p>
      <sec id="sec-5-1">
        <title>5.1. Comparative Analysis</title>
        <p>We evaluated our framework based on the Data Quality Dimensions [11] - accessibility, intrinsic,
contextual, and representational. Wang et al. provided for each of the dimensions further
criteria to discuss the data quality of systems. Data in the context of our framework is a triple
based on the used Dataset, the QualityContracts, and the QualityValidationReports. Based on
these criteria, table 1 illustrates the results of comparing two versions of the pipeline presented
in section 4; one without the data validation framework (baseline) and one where the framework
has been applied. Evaluation results are expressed through arrows, which represent changes in
comparison to the baseline. Results are excluded for quality criteria when the framework does
not directly afect the metric (represented as ∅).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Intrinsic Data Quality</title>
        <p>Intrinsic Data Quality focuses on the extent to which data values are in conformance with
reality. The baseline implementation, evaluated the validation rules as an embedded part of
the pipeline, and its results were only used for the processing. In comparison to the baseline,
our framework improves Believability - the extent to which data is accepted or regarded as
true or credible - by enabling users to query and retrieve QualityContracts that have been
used for validation of the datasets. Furthermore, it allows the users to retrieve the concrete
QualityConstraints with their implementation, thus improving the credibility of the used data.
Accuracy is the extent to which data is correct and certified error-free. Unlike the baseline
approach, our framework separates the application logic of the pipeline from the validation
rule. The validation rule is validated by Guardians using QualityContractResources for the
computation. The Guardians use the same libraries and tools to understand the validation
rules as the baseline implementation. Therefore, no change in the accuracy can be observed.
Objectivity is the extent to which the data is unbiased and impartial. Our framework does not
directly afect the objectivity of the dataset that needs to be validated but the objectivity of the
validation rules. The validation rules are separated from the implementation in the pipeline,
reducing the potential of developing ad-hoc rules. Therefore, the framework provides means to
develop the processing pipeline and the validation rules separately, which may improve the
overall objectivity. Similar to Objectivity, Reputation - the extent to which data is trusted
or highly regarded in terms of their source or content - is not afected directly. However, the
validation rules can be easier accessed through the knowledge graph providing them, and
Quality reports are provided, building trust in the content of the data. With this information,
data can be more transparently assessed and thereby enhance its overall reputation.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Contextual Data Quality</title>
        <p>Contextual Data Quality relates to the extent to which data applies to a data user’s tasks. The
Value-added criteria refers to the extent to which users gain an advantage from the provided
data. Relevancy measures the extent to which data is applicable and helpful for the application.
Completeness defines the extent to which breadth, depth, and scope of the data is necessary.
Despite their importance, these criteria are outside the scope of our proposal. The three quality
criteria are mainly focused on the provided use case. Nevertheless, we believe that applying
our framework can aid in evaluating the quality of the knowledge graph pipeline for adopting
new use cases. The framework ofers easy-to-access information about existing data sources,
characterizes the data, and ofers reusable validation methods. This approach not only makes
evaluations and assessments of the pipeline’s data relevancy and completeness easier but also
supports the creation of new knowledge graph population pipelines.</p>
        <p>Timeliness measures the availability of new data points to a user. Our approach demands
additional time for the validation process, given that the QualityValidationReport is published
on the knowledge graph before the result is delivered to the processing element. As a result, an
improvement in the timeliness can be reported to the end user, while a decrease in the timeliness
can be observed for the processing component. Appropriate amount of data is defined as
the extent to which the quantity of available data is appropriate. Our framework generates
for each dataset and each QualityConstraint a QualityValidationReport. Similarly, the baseline
produces a result for each validation rule. However, with our framework, more data is available
about the validation, describing not only the result but also the validation time, the involved
QualityConstraint, and the involved software agent.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Representational Data Quality</title>
        <p>Representational Data Quality focuses on the extent to which data is presented in a way that is
understandable and clear for the data user‘s task. It incorporates Interpretability - the extent
to which data is in an appropriate language and data definitions are clear - that is enhanced
by our framework as it utilizes RDF to describe QualityContracts, QualityValidation for the
validation process, QualityConstraints for the validation rule, and QualityValidationResources
for the explicit implementation. Additionally, our framework thoroughly incorporates existing
ontologies to enhance both ease of understanding and reusability (Ease of understanding
the extent to which data is unambiguous and easy to understand). The baseline utilizes diferent
validation frameworks for diferent validation tasks. This produces diferent types of outcomes,
impacting its interpretability and ease of understanding. Compared to the baseline pipeline,
consistent reporting of validation information from the data is provided, which can be used in
applications to indicate the health of the pipeline and databases. The results are generated in
two formats - a machine-readable version that contains information solely about the validation
outcome and a longer, human-readable format that presents details generated by the diferent
validation methods. Therefore, Representational consistency - the extent to which data
is always presented in the same format and is compatible with previous data - and Concise
presentation - the extent to which data is presented compactly without being overwhelming
are improved over the baseline pipeline.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Accessibility Data Quality</title>
        <p>Accessibility Data Quality focuses on the extent to which data is available or obtainable. Access
security - the extent to which access to data can be restricted and therefore kept secure
is beyond the scope of our work, as it relates more to the design of the data transformation
pipeline, rather than the quality of the data or the validation procedure. Nevertheless, we retrieve
validation rules from Artifacts provided by QualityValidationResources. These Artifacts can
come from multiple sources, each potentially secured by diferent methods. Within the pipeline,
the integrity of the rules is validated by a provided checksum and then executed in the Guardian.
This ensures that access security is maintained. In comparison to the baseline pipeline, our
approach improves Accessibility - the extent to which data is available or easily and quickly
retrievable. By employing RDF to represent QualityContracts and QualityValidations, our
framework simplifies the querying of Contracts and validations results.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we addressed the data quality of validation methods. More specifically we
investigated the tradeof between heterogeneity of validation reports and accuracy of validation results,
and presented an answer to the research question: "How can the data quality of heterogeneous
data validation methods be improved in the context of data processing pipeline for knowledge
graphs?" We propose a two-part framework consisting of QualityContracts and Guardians.
QualityContracts describes sets of validation rules and includes the required information for
data validation in both human and machine-readable formats. Guardians, interpret the
machinereadable information, execute the validation rules, and produce comprehensive reports of both
the results and the validation process.</p>
      <p>We evaluated our approach through a comparative analysis with a processing pipeline for
knowledge graph population implemented at a European airport. By comparing the initial
version of the pipeline with its version after incorporating our framework, we observed
improvements in the intrinsic (believability), representational (e.g. clarity and conciseness), and
accessibility (ease of access) criteria of data quality, all while maintaining a high level of result
accuracy created by diferent validation methods.</p>
      <p>Our framework improves criteria of data quality, by relying on a software agent (Guardian),
which implements diferent validation methods. As a result, Guardians are inherently equipped
with specific libraries and features tailored to evaluate the specified QualityContracts.
However, this built-in setup means that Guardians posses limited flexibility to add new validation
constraints from new validation methods. This limitation necessitates regular updates of the
Guardian. Ideally, Guardians would be able to autonomously update their libraries based on
predefined criteria or settings. Additionally, while our framework has been designed with
scalability in mind, we have not yet conducted empirical scalability tests. Therefore, although we
anticipate that the system to scale efectively, it remains unverified. Furthermore, our framework
does not yet provide tools to streamline the creation of QualityContracts, QualityConstraints,
and QualityValidationResources. Consequently, significant manual input is required to define
these resources.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <p>This work was funded by the Bayerisches Verbundforschungsprogramm (BayVFP) des Freistaates
Bayern through the KIWI project (grant no. DIK0318/03).</p>
    </sec>
    <sec id="sec-8">
      <title>8. Author Contributions</title>
      <p>Conceptualization, R.D., M.F., and J.F; methodology, R.D.; software, R.D.; validation, R.D.; writing
- original draft preparation, R.D.; writing - review and editing R.D., M.F., and J.F.; supervision
A.H.; funding acquisition, A.H. All authors have read and agreed to the published version of the
manuscript.
L. Kagal (Eds.), The Semantic Web – ISWC 2020, volume 12507, Springer International
Publishing, Cham, 2020, pp. 464–481. doi:10.1007/978-3-030-62466-8_29.
[2] T. Hubauer, S. Lamparter, P. Haase, D. Herzig, Use Cases of the Industrial Knowledge
Graph at Siemens, in: ISWC 2018 Posters &amp; Demonstrations, Industry and Blue Sky Ideas
Tracks, Monterey, USA, 2018.
[3] T. Liebig, A. Maisenbacher, M. Opitz, J. R. Seyler, G. Sudra, J. Wissmann, Building a
Knowledge Graph for Products and Solutions in the Automation Industry (2019). URL:
https://ceur-ws.org/Vol-2489/paper2.pdf.
[4] N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, J. Taylor, Industry-scale knowledge
graphs: Lessons and challenges, Communications of the ACM 62 (2019) 36–43. doi:10.
1145/3331166.
[5] D. Diefenbach, M. D. Wilde, S. Alipio, Wikibase as an Infrastructure for Knowledge
Graphs: The EU Knowledge Graph, in: A. Hotho, E. Blomqvist, S. Dietze, A. Fokoue,
Y. Ding, P. Barnaghi, A. Haller, M. Dragoni, H. Alani (Eds.), The Semantic Web – ISWC
2021, volume 12922, Springer International Publishing, Cham, 2021, pp. 631–647. doi:10.
1007/978-3-030-88361-4_37.
[6] G. Tamašauskaitė, P. Groth, Defining a Knowledge Graph Development Process Through
a Systematic Review, ACM Transactions on Software Engineering and Methodology 32
(2023) 27:1–27:40. doi:10.1145/3522586.
[7] H. Knublauch, D. Kontokostas, Shapes Constraint Language (SHACL), W3C
Recommendation, W3C, 2017. URL: https://www.w3.org/TR/2017/REC-shacl-20170720/.
[8] E. Prud’hommeaux, J. Labra Gayo, H. Solbrig, Shape expressions: An rdf validation and
transformation language, ACM International Conference Proceeding Series 2014 (2014).
doi:10.1145/2660517.2660523.
[9] L. E. Lwakatare, E. Rånge, I. Crnkovic, J. Bosch, On the experiences of adopting automated
data validation in an industrial machine learning project, 2021 IEEE/ACM 43rd
International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
(2021) 248–257. doi:10.1109/ICSE-SEIP52600.2021.00034.
[10] Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M.
Stonebraker, N. Tang, Detecting data errors: Where are we and what needs to be done?,
Proceedings of the VLDB Endowment 9 (2016) 993–1004. doi:10.14778/2994509.2994518.
[11] R. Y. Wang, D. M. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers,
Journal of Management Information Systems 12 (1996) 5–33. doi:10.1080/07421222.
1996.11518099.
[12] L. Cai, Y. Zhu, The Challenges of Data Quality and Data Quality Assessment in the Big</p>
      <p>Data Era 14 (2015) 2. doi:10.5334/dsj-2015-002.
[13] H. Sándor, B. Genge, Z. Szántó, Sensor data validation and abnormal behavior detection in
the Internet of Things, in: 2017 16th RoEduNet Conference: Networking in Education and
Research (RoEduNet), 2017, pp. 1–5. doi:10.1109/ROEDUNET.2017.8123740.
[14] D. N. Bonter, C. B. Cooper, Data validation in citizen science: A case study from Project
FeederWatch, Frontiers in Ecology and the Environment 10 (2012) 305–307. doi:10.1890/
110273.
[15] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, A. Grafberger, Automating
large-scale data quality verification, Proceedings of the VLDB Endowment 11 (2018)
1781–1794. doi:10.14778/3229863.3229867.
[16] L. Ehrlinger, V. Haunschmid, D. Palazzini, C. Lettner, A DaQL to Monitor Data Quality in
Machine Learning Applications, in: S. Hartmann, J. Küng, S. Chakravarthy, G.
AnderstKotsis, A. M. Tjoa, I. Khalil (Eds.), Database and Expert Systems Applications, Lecture
Notes in Computer Science, Springer International Publishing, Cham, 2019, pp. 227–237.
doi:10.1007/978-3-030-27615-7_17.
[17] E. Breck, M. Zinkevich, N. Polyzotis, S. Whang, S. Roy, Data validation for machine
learning, in: Proceedings of SysML, 2019. URL: https://mlsys.org/Conferences/2019/doc/
2019/167.pdf.
[18] R. L. Akeson, X. Chen, D. Ciardi, M. Crane, J. Good, M. Harbut, E. Jackson, S. R. Kane,
A. C. Laity, S. Leifer, M. Lynn, D. L. McElroy, M. Papin, P. Plavchan, S. V. Ramírez, R. Rey,
K. von Braun, M. Wittman, M. Abajian, B. Ali, C. Beichman, A. Beekley, G. B. Berriman,
S. Berukof, G. Bryden, B. Chan, S. Groom, C. Lau, A. N. Payne, M. Regelson, M. Saucedo,
M. Schmitz, J. Staufer, P. Wyatt, A. Zhang, The NASA Exoplanet Archive: Data and Tools
for Exoplanet Research, Publications of the Astronomical Society of the Pacific 125 (2013)
989–999. doi:10.1086/672273.
[19] N. Fursova, Methodology for Data Validation 1.1, Technical Report, 2018. URL:
https://cros-legacy.ec.europa.eu/system/files/ess_handbook_-_methodology_for_data_
validation_v1.1_-_rev2018_0.pdf, last accessed on 2023-08-25.
[20] F. Pezoa, J. L. Reutter, F. Suarez, M. Ugarte, D. Vrgoč, Foundations of json schema, in:
Proceedings of the 25th International Conference on World Wide Web, International World
Wide Web Conferences Steering Committee, 2016, pp. 263–273.
[21] O. K. Foundation, Frictionless specifcations, 2023. URL: https://github.com/frictionlessdata/
specs, last accessed on 2023-08-25.
[22] Z. Zheng, B. Zhou, D. Zhou, A. Soylu, E. Kharlamov, Executable Knowledge Graph for
Transparent Machine Learning in Welding Monitoring at Bosch, in: Proceedings of the
31st ACM International Conference on Information &amp; Knowledge Management, ACM,
Atlanta GA USA, 2022, pp. 5102–5103. doi:10.1145/3511808.3557512.
[23] L. L. Pipino, Y. W. Lee, R. Y. Wang, Data quality assessment, Commun. ACM 45 (2002)
211–218. doi:10.1145/505248.506010.
[24] C. Cappiello, C. Francalanci, B. Pernici, Data quality assessment from the user’s perspective,
in: Proceedings of the 2004 International Workshop on Information Quality in Information
Systems, IQIS ’04, Association for Computing Machinery, New York, NY, USA, 2004, p.
68–73. URL: https://doi.org/10.1145/1012453.1012465. doi:10.1145/1012453.1012465.
[25] S. Shankar, L. Fawaz, K. Gyllstrom, A. G. Parameswaran, Moving fast with broken data,
2023. arXiv:2303.06094.
[26] O. ten Bosch, M. van der Loo, Standard Report Structure Essnet ValiDat Integration, 2018.</p>
      <p>URL: https://cros-legacy.ec.europa.eu/system/files/wp2-genericvalidationreport.pdf, last
accessed on 2023-08-25.
[27] I. A. T. Association, Recommended Practice 1745 - Baggage Information Messages, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Kalaycı</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Grangel</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lösch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , A.
          <string-name>
            <surname>ul-Mehdi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kharlamov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <article-title>Semantic Integration of Bosch Manufacturing Data Using Virtual Knowledge Graphs</article-title>
          , in: J.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>